[DISCUSSION] General folder structure for reco, cv, and forecasting repos

miguelgfierro commented 5 years ago

I'm setting this discussion public in case any of our users or customers want to provide feedback.

Context

We are building repos around computer vision and time series forecasting. We would like to homogenise the structure between them and the recommenders repo. The CV repo is still starting and the forecast repo has been running for some time internally and it is focused on benchmarks.

The idea is to have a common structure (and user experience) between the 3 repos. Trying to have the best of each: nice examples and utilities from recommenders, nice benchmarks from forecasting repo and support for CV, as well as the other solutions in reco and forecast.

Question

What will be the optimal structure that will help our users and us to build better solutions in recommendations, CV and forecasting?

Please provide answers in detail ways, example: e1) I would take the recommenders structure (notebooks, reco_utils, tests) and rename the folders to X, Y, Z... e2) I would take the recommenders structure (notebooks, reco_utils, tests) and add a folder for benchmarks... e3) ...

heatherbshapiro commented 5 years ago

I think one of the other main differences between the recommenders and forecast repo structure is the idea of broad vs industry specific. I think we want a mix of both mixed throughout the repos, as there is always going to be someone looking for a specific example, like fraud detection, that offers a concrete example of how to do a thing. The more broad examples are useful if you’re looking for ideas like “anomaly detection” in general. We should also figure out how to have a progression between general > specific workflows.

I like the general structure of the recommenders repo but with some tweaks. I think we should have some sort of algo cheat sheet explaining why a user might choose one algorithm over the other (similar to the Studio cheat sheet , and that could also include the benchmarks similar to the time series repo.

@bethz and I were discussing, and we thinkg this repo scaffolding could work:

Notebooks
- local (might need a better name)
  - prepare data, models, ...
- AzureML
  - prepare data, models, ...
Examples by Industry Vertical
utils (should be named utils instead of reco_utils)
scripts
tests
- Moves Tests.md here
.gitignore
Contributing
Contributors (A better name might be AUTHORS/ CODE OWNERS)
License
SETUP.md
README
- Name
- Overview
- Getting Started
- Notebooks
- Algorithm Comparison (cheat sheet)
  - Benchmarks
- Why AzureML?
- Releases
- Contributing
- Build Status

gramhagen commented 5 years ago

I like this setup, most repos also have a code_of_conduct file. I'm not sure we need a separate azureml directory under notebooks, we might just have examples scattered through there on how to utilize azure ml?

miguelgfierro commented 5 years ago

Great suggestion @heatherbshapiro and @bethz. I like the structure.

I'm just going to comment on the changes that you propose:

Notebooks
- local (might need a better name)
  - prepare data, models, ...
- AzureML
  - prepare data, models, ...
Examples by Industry Vertical

In my mind notebooks and examples by industry is the same concept, so maybe, to reduce the folders we can have 1 folder called examples. So we will have 4 big folders: examples, utilities, tests and scripts.

I think it is very interesting to have information about industry vertical. I think we need to think this well to not repeat code. This is related to how we do the transition of general to specific. Maybe we can add documentation that explains what would be the best algorithms if you are an ecommerce, and then link a notebook with an example. Or we can mention, algorithm X, Y, Z is very good for this kind of business. This could also be related to the idea of recommending an algorithm.

I agree with @gramhagen, I wouldn't add an AzureML folder. I think we should add AzureML is a less explicit way and in the places where it add value. In some situations, like the test or for computing large networks is very valuable. In that situations we use AzureML because it makes a significant difference.

utils (should be named utils instead of reco_utils)
scripts
tests
- Moves Tests.md here
.gitignore
Contributing
Contributors (A better name might be AUTHORS/ CODE OWNERS)
License
SETUP.md

I agree with all this. In the past the name we had was utilities. Also, about renaming Contributors, I think it makes sense, I've seen other repos using the name AUTHORS.

README
- Name
- Overview
- Getting Started
- Notebooks
- Algorithm Comparison (cheat sheet)
  - Benchmarks
- Why AzureML?
- Releases
- Contributing
- Build Status

Here I would remove releases. Another change I would suggest is to give more importance to the benchmarks, and this is something that the forecasting guys are doing really well. If we target professional DS, they would be super interesting on seeing algos results on different datasets. An idea on how to do this would be to have a BENCHMARKS.md, and then have code that computes the algos in the folder examples. Ideally, we would like the benchark tables to be generated automatically (the idea of a live benchmark was proposed some time ago but not implemented yet https://github.com/Microsoft/Recommenders/issues/282).

heatherbshapiro commented 5 years ago

I think we want to be very explicit on the value add of AzureML and how it enhances the existing solutions. If they are intertwined in the same folder, then it might be difficult to differentiate what I can do without it and what I can do with it and what the actual value add is. We could call it out in the readme for descriptions of each notebook, but we do need to differentiate them somehow.

wutaomsft commented 5 years ago

local->generic AzureML->azure (have a preference of using lower case in first letter of directory so user does not need to press the shift key). also, we used reco_utils (instead of utils) because it avoids confusion with other utlis that users may have on PYTHONPATH. With upcoming ts and cv repos, we should try avoid generic package names.

bethz commented 5 years ago

A few items:

can we use utils_rec instead of reco_utils? makes it easier for users to recognize it's a utils folder. TLA is more common than 4 letters. And by having all our stuff be utils_xxx and utils_yyy it's easier to find them in the path
I need to think on/incorporate other comments

heatherbshapiro commented 5 years ago

I don't think that we would want to only break it down by industry verticals. This will exclude users who do not fit into these categories. I think we want to have both a broad view and a specific industry view so that users can benefit from both.

wutaomsft commented 5 years ago

@bethz rec is already widely used to be the shorthand of record. Changing names like this requires a lot of work, so let's make sure there is strong reason in doing that.

wutaomsft commented 5 years ago

@heatherbshapiro how about we list algos as the primary view, then add a folder called "use_cases" or similar to surface different use cases?

chenhuims commented 5 years ago

@heatherbshapiro @wutaomsft It would be nice to accommodate both views and creating two separate folders like 'algorithms' and 'use_cases' sounds a good way to me. This may incur some redundancy though and I'm thinking how to avoid the overlap between these two folders. Maybe we can add a toy example under 'algorithms' or simply document each algorithm there with links referring to the implementations in 'use_cases' folder?

heatherbshapiro commented 5 years ago

I agree, i like the idea of having 'algorithms' with a toy example and then 'use_cases' if there is not too much work - that was similar to what I originally proposed with notebooks and industry verticals. Would use_cases be a sub directory of algos or another top level one?

wutaomsft commented 5 years ago

What I had in mind was a list algorithms (no need for an "algorithms" top level folder). There would be a folder called "use_cases" that describe how one would solve a particular use case. I don't think we need to duplicate any content in algorithms notebooks here; instead we will just present some technical discussions of the use case and then make references to the appropriate algorithm notebooks (which could use that particular use case as an example anyway). In this sense, the "use cases" folder is more like a "soft link" to other materials in the repo.

anargyri commented 5 years ago

I am not sure what is the situation with forecasting, but for recommendations there are not so many public data sets available to cover all the use cases that arise in practice. So this restricts how many notebooks / code we can write, whereas we can write documentation for more use cases.

yueguoguo commented 5 years ago

It would be great to simplify the top-level folder layout like "examples, utilities, tests and scripts" and place them in parallel. Under examples we may have want to have notebooks for

quick-start (or anything else we want to call it) that contains toy-examples for demonstrating particular algorithms, together with necessary data prep, feature engineering, etc., for actual use cases.
deep-dive which explains what each algorithms/data-prep/benchmarks do in a theoretical way.
azure-deploy that shows how things can be augmented by using Azure services (e.g., AML, Databricks, Cosmos DB, etc.).

To elaborate more, in the reco repo, we have "quick-start" notebooks that demonstrate how algorithms can be developed into an e2e pipeline (in local environment). Probably we can extend these notebooks with use case examples to show the efficacy of each algorithm in resolving a particular problem. The "quick-start" notebooks reference the "deep-dive" notebooks for technical details about data-prep/feature-engineering/algorithm and "azure-deploy" notebooks for about Azure services and products.

The idea of cheat sheet is great. Alternatively we can tabulate the offerings in the repo and list their pros & cons as guidance for users to choose wisely for their actual use cases.

Generally, IMHO it would be great to make the things in each repo "modular", "generic" and "reusable".

miguelgfierro commented 5 years ago

i like the idea of having 'algorithms' with a toy example and then 'use_cases' if there is not too much work

algorithms looks pretty similar to what we call quick_start. Then we have in model the deep dives which explain in detail each algo, then we have operationalization that focuses on production (and assume that if you want to know in detail about an algo, you would go to the deep dives).

@chenhuims I would like to understand better the usecases you guys have. I see that you have 2 industries: sales forecasting and energy forecasting, and also you only have 1 dataset per each. Do you have a list of all the usecases you would like to add and the datasets that you would use?

same question for @PatrickBue

PatrickBue commented 5 years ago

Hi, not sure this answers your question @miguelgfierro fully, but at the very least we will have ~10 internal datasets for testing and parameter tuning. In addition, there will be at least one tiny dataset for each use case which is used in the demo notebooks, plus eventually real-life examples for different verticals. We do not currently plan to show how to train using research datasets such as MNIST (too small resolution and hence meaningless) or ImageNet/Coco (too big and pre-trained models exist anyway). Just ping me if you have any follow-up questions..

gramhagen commented 5 years ago

https://github.com/zalandoresearch/fashion-mnist is an interesting alternative to mnist

chenhuims commented 5 years ago

@wutaomsft I think it is a good idea to use a soft link to bridge the gap between use cases and algorithms. This echoes with my thoughts about doing this in the other way by referring to the use cases in the 'algorithms' folder.

recommenders-team / recommenders

[DISCUSSION] General folder structure for reco, cv, and forecasting repos #481

Context

Question