khuyentran1401 / data-science-template

Template for a data science project
679 stars 197 forks source link

Refactorize Cookie-cutter command prompt to create a heavily modularized data science template #22

Closed tapyu closed 5 months ago

tapyu commented 5 months ago

Hi @khuyentran1401 !

I just saw in https://github.com/cookiecutter/cookiecutter/pull/1881 that cookiecutter finally has the feature of adding human-readable prompts to the different variables. This enables us to create a more sophisticated data science template.

My initial thoughts is to make a step further in what I did in #18 (I didn't check exactly how it looked like since you made some modifications). My initial idea is to categorize all this giant universe of Machine Learning tools regarding its functionalities (logging, orchestration, data storage, Python linter and code formatter, etc), and then list all tools so that the user may choose one. Therefore, my initial idea is: to create a heavily modularized data science template, in which the final template structure depends on the tools opted by the user, but at the same time to ensure that the directory structure don't vary too much.

However, I am not sure if you share the same goal as me. I just saw that you removed DVC, so you may have some considerations to do regarding this goal.

What do you think about it?

dagshub[bot] commented 5 months ago

Join the discussion on DagsHub!

khuyentran1401 commented 5 months ago

Great suggestions! We can try that. Let's start with DVC only and see if it works well before proceeding with adding more tools. Would you mind creating a PR for this?

tapyu commented 5 months ago

No. I just need some time :)

tapyu commented 5 months ago

Let's start with DVC only and see if it works well before proceeding with adding more tools.

Can you please detail how you are expecting by "startng with DVC only"?

tapyu commented 5 months ago

My idea is, first, to define the categories. From all tools you've used, can you help me to classify them into categories?

When it comes to MLOps, you are much more expert than me, so you should probably have a more accurate answer for that. My first definitions would be:

The names in the parenthesis would be the option list I would create for each category. In this context, I really didn't understand what "starting with DVC only" mean.

khuyentran1401 commented 5 months ago

Upon further consideration, implementing the idea of providing templates for all machine learning tools and their functionalities would require a significant amount of work. Additionally, maintaining and updating these templates as the packages evolve would require substantial effort. Moreover, there are already other libraries like ZenML that offer support for these functionalities.

Given these factors, I plan for this repository to remain focused on providing a flexible framework for any data science project. Therefore, I believe it's best to maintain the current approach. However, you're welcome to fork this template and pursue your idea separately.