khuyentran1401 / data-science-template

Template for a data science project
692 stars 198 forks source link

Add `aws-sagemaker` branch #14

Closed tapyu closed 8 months ago

tapyu commented 9 months ago

Hi Khuyen!

Do you have any interest in getting acquaint with Amazon Web Services (AWS) and its Sagemaker ML framework? I am currently working on it, and I extended your repo to accommodate your template in the AWS Sagemaker ecosystem. That is a WIP. There is a lot to do as AWS is quite complex and I am still learning. However, the core concepts was already done :)

I am going to probably make many commits before I finally open it to merge. My goal in opening this PR beforehand is twofold:

dagshub[bot] commented 9 months ago

Join the discussion on DagsHub!

khuyentran1401 commented 9 months ago

Pretty cool. So this integration makes it possible to use the template on Sagemaker? Could you simply run cookie-cutter command in Sagemaker console to use the template in Sagemaker?

tapyu commented 9 months ago

So this integration makes it possible to use the template on Sagemaker?

Yes.

Could you simply run cookie-cutter command in Sagemaker console to use the template in Sagemaker?

It is much better, indeed. I am felling pretty stupid as I didn't think about it before hahaha I will fix it. However, it is not on the AWS console we should run cookiecutter. Rather, it is on the Code Editor terminal, within a domain.

khuyentran1401 commented 9 months ago

Got it. I haven't used Code Editor terminal, but I assume that we can just run cookiecutter on the Code Editor terminal. Is that correct?

tapyu commented 9 months ago

Is that correct?

Yes, it is as simple as you said.

khuyentran1401 commented 9 months ago

Would you prefer to integrate all these modifications into a new branch, or are there certain elements you'd like to adopt while excluding others? I want to avoid any potential confusion for users that might arise from creating an additional branch if we can accomplish the same goals within the current one.

tapyu commented 9 months ago

are there certain elements you'd like to adopt while excluding others?

I am not sure what you mean by "elements". However, I totally agree on the idea of avoiding unnecessary branches. At the moment, we are using branches to organize the approaches¹. The more approaches we have, the more branches we will create. It is not great! To solve it, I could suggest the following idea:

1- Have only one branch (main or whatever). 2- Put all templates with in the root directory of main, or within a directory, e.g., ./templates/, e.g.,

templates
├── {{cookiecutter.template1}}
├── {{cookiecutter.template2}}
.
.
.
└── {{cookiecutter.templaten}}

3- Create a prompt dialog to help the user which template they want to use. It would be something like:

1- Which approach you want to use for your Data Science project?
  [ ] DCV + pip
  [ ] DVC + poetry
  [x] AWS Sagemaker
2- Do you want to work with AWS Sagemaker on the cloud or on your local machine?
  [ ] Local
  [x] Cloud
3- Do you want to use the Amazon S3 to store your artifacts?
  [x] Yes
  [ ] No

and so on... In that way, we can help the user to choose the right set of tools interactively, and in the end, we call cookiecutter to create a directory template tailored to the user needs.

I am not sure if cookiecutter can provide such a prompt dialog (I don't think so). If not, we should think what would be the best way out.

Let me know if you think it is good idea or if it is too different from you are thinking of :)

¹: By "approach", I mean a set of tools that are used together: DVC+pip, DVC+poetry, AWS Sagemaker...

khuyentran1401 commented 9 months ago

@tapyu I've reviewed the changes in this PR and it looks like some of them are related to AWS Sagemaker, while others are not. Would it be possible to create a separate branch for the changes that are not related to Sagemaker? I'm still undecided about whether or not to have a different template/branch for Sagemaker, but I think some of the other changes you made are worth incorporating. Let me know what you think, thanks!

khuyentran1401 commented 9 months ago

I like your idea of a prompt dialog to help the user choose which template they want to use. We can use hooks to customize a template dynamically according to user preferences and use Choice Variables to offer users a selection of predefined options.

However, I'm concerned that the current question we're using might be a bit confusing for some users. For example, the question "Do you want to use DVC + pip or DVC + poetry?" might not be immediately clear for users who are new to these tools.

I think it would be better to rephrase the question to make it more straightforward. For example:

Select dependency manager:
1 - pip
2 - poetry
Choose from 1, 2 [1]:

This way, users can easily understand the choice they're making and select the option that best fits their needs.

What do you think?

tapyu commented 9 months ago

@tapyu I've reviewed the changes in this PR and it looks like some of them are related to AWS Sagemaker, while others are not. Would it be possible to create a separate branch for the changes that are not related to Sagemaker?

Now I got it. I will remove the non-AWS related stuff ASAP.

I'm still undecided about whether or not to have a different template/branch for Sagemaker

Indeed, I am not sure whether it is worth having one branch for AWS either. In the end, we may not need another branch for AWS, and this PR would be closed instead of merged.

tapyu commented 9 months ago

What do you think?

Amazing idea! This is much better. So if we implement it, would we delete the current branches?

khuyentran1401 commented 9 months ago

Great! I look forward to your PR.

If we successfully implement the proposed changes and verify their effectiveness through rigorous testing, we can consider deleting the existing branches (with the exception of the Prefect branch, which will be retained for purposes of the article that has been written about it).