ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

making the CI more reliable #59

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

We have a bunch of integration tests that fetch data and notebooks using Kaggle's API. However, they fail sometimes (probably because the Kaggle API isn't very reliable). See https://github.com/ploomber/soorgeon/issues/58

One solution is to use the Kaggle API once and then upload the files to an S3 bucket, then have the CI download them from S3 instead of Kaggle.

idomic commented 2 years ago

@edublancas what about including them in one of the repos?

edublancas commented 2 years ago

including the files? the problem are the data files, some of them are a few MBs in size, I think we should store them somewhere else

Wxl19980214 commented 2 years ago

Ok I think store in S3 bucket would be good. I am going to list my thoughts here:

  1. create a S3 bucket for notebook and data
  2. write a script to download from kaggle and then upload to s3
  3. run manually once to store what we have now in index.yaml to s3
  4. change the CI to retrieve data from S3 instead of downloading again
  5. Add another CI triggered by change in index.yaml or in regular CI just add one check if index.yaml has been changed to run the download and upload script again.

You guys think this is a viable approach?

idomic commented 2 years ago

Yeah it is, I don't think you need 2, just do it manually since there are 6-7 notebooks. We have an automated script in the CI but it fails since the kaggle API isn't stable. I didn't get step #5 thought, elaborate?

You'd probably also need to make this bucket public and we shouldn't have anything in there besides those notebooks.

I'm also thinking of just opening a git repo just for those and consuming it, that way we don't need to pay for S3 get requests every time (try this approach first).

On Wed, Jun 8, 2022 at 4:54 PM Xilin @.***> wrote:

Ok I think store in S3 bucket would be good. I am going to list my thoughts here:

  1. create a S3 bucket for notebook and data
  2. write a script to download from kaggle and then upload to s3
  3. run manually once to store what we have now in index.yaml to s3
  4. change the CI to retrieve data from S3 instead of downloading again
  5. Add another CI triggered by change in index.yaml or in regular CI just add one check if index.yaml has been changed to run the download and upload script again.

You guys think this is a viable approach?

— Reply to this email directly, view it on GitHub https://github.com/ploomber/soorgeon/issues/59#issuecomment-1150404905, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYPJOMSOJOEA6XIM54YZWDVOEB6TANCNFSM5X6U55AQ . You are receiving this because you commented.Message ID: @.***>

Wxl19980214 commented 2 years ago

For number 5 If someone contributed by adding a new url in soorgeon/_kaggle/index.yaml, then we need to upload it to S3 right. So we can of course manually do it. But I am thinking maybe write a workflow triggered by changes in index.yaml. (which i dont know if it is possible to achieve) Then it runs the script to upload newly downloaded notebook.

Wxl19980214 commented 2 years ago

If we do this manually, i think all we need to change is the test right? Instead of downloading again from kaggle, we read directly from our storage repo. And if in the future someone added new urls, we also manually put it into our storage repo?

This works?

idomic commented 2 years ago

Yes, that's correct! I think let's skip 5 for now, and if we see the number of notebooks scales to double/triple then we can do that.

On Wed, Jun 8, 2022 at 5:18 PM Xilin @.***> wrote:

If we do this manually, i think we can all we need to change is the test right? Instead of downloading again from kaggle, we read directly from our storage repo. And if in the future someone added new urls, we also manually put it into our storage repo?

This works?

— Reply to this email directly, view it on GitHub https://github.com/ploomber/soorgeon/issues/59#issuecomment-1150424594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYPJOJNQZ7ZF7PNLBDJW6DVOEEZ3ANCNFSM5X6U55AQ . You are receiving this because you commented.Message ID: @.***>

idomic commented 2 years ago

Let's try this line of thoughts: Try retrying the steps up to 3 times, if that keeps on breaking, Try via git - sample data Then if that fails, full data on S3

edublancas commented 2 years ago

👍