Closed edublancas closed 2 years ago
@edublancas what about including them in one of the repos?
including the files? the problem are the data files, some of them are a few MBs in size, I think we should store them somewhere else
Ok I think store in S3 bucket would be good. I am going to list my thoughts here:
You guys think this is a viable approach?
Yeah it is, I don't think you need 2, just do it manually since there are 6-7 notebooks. We have an automated script in the CI but it fails since the kaggle API isn't stable. I didn't get step #5 thought, elaborate?
You'd probably also need to make this bucket public and we shouldn't have anything in there besides those notebooks.
I'm also thinking of just opening a git repo just for those and consuming it, that way we don't need to pay for S3 get requests every time (try this approach first).
On Wed, Jun 8, 2022 at 4:54 PM Xilin @.***> wrote:
Ok I think store in S3 bucket would be good. I am going to list my thoughts here:
- create a S3 bucket for notebook and data
- write a script to download from kaggle and then upload to s3
- run manually once to store what we have now in index.yaml to s3
- change the CI to retrieve data from S3 instead of downloading again
- Add another CI triggered by change in index.yaml or in regular CI just add one check if index.yaml has been changed to run the download and upload script again.
You guys think this is a viable approach?
— Reply to this email directly, view it on GitHub https://github.com/ploomber/soorgeon/issues/59#issuecomment-1150404905, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYPJOMSOJOEA6XIM54YZWDVOEB6TANCNFSM5X6U55AQ . You are receiving this because you commented.Message ID: @.***>
For number 5 If someone contributed by adding a new url in soorgeon/_kaggle/index.yaml, then we need to upload it to S3 right. So we can of course manually do it. But I am thinking maybe write a workflow triggered by changes in index.yaml. (which i dont know if it is possible to achieve) Then it runs the script to upload newly downloaded notebook.
If we do this manually, i think all we need to change is the test right? Instead of downloading again from kaggle, we read directly from our storage repo. And if in the future someone added new urls, we also manually put it into our storage repo?
This works?
Yes, that's correct! I think let's skip 5 for now, and if we see the number of notebooks scales to double/triple then we can do that.
On Wed, Jun 8, 2022 at 5:18 PM Xilin @.***> wrote:
If we do this manually, i think we can all we need to change is the test right? Instead of downloading again from kaggle, we read directly from our storage repo. And if in the future someone added new urls, we also manually put it into our storage repo?
This works?
— Reply to this email directly, view it on GitHub https://github.com/ploomber/soorgeon/issues/59#issuecomment-1150424594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYPJOJNQZ7ZF7PNLBDJW6DVOEEZ3ANCNFSM5X6U55AQ . You are receiving this because you commented.Message ID: @.***>
Let's try this line of thoughts: Try retrying the steps up to 3 times, if that keeps on breaking, Try via git - sample data Then if that fails, full data on S3
👍
We have a bunch of integration tests that fetch data and notebooks using Kaggle's API. However, they fail sometimes (probably because the Kaggle API isn't very reliable). See https://github.com/ploomber/soorgeon/issues/58
One solution is to use the Kaggle API once and then upload the files to an S3 bucket, then have the CI download them from S3 instead of Kaggle.