Sunset existing examples in favor of new ones with better datasets

ibm-granite-community / granite-timeseries-cookbook

Granite Time Series Cookbook

Creative Commons Attribution 4.0 International

12 stars 4 forks source link

Sunset existing examples in favor of new ones with better datasets #18

Closed rawkintrevo closed 1 month ago

rawkintrevo commented 1 month ago

recipes/Time_Series/Preprocessor_Use_and_Performance_Evaluation.ipynb and recipes/Time_Series/Time_Series_Getting_Started.ipynb

use datasets from Kaggle's walled garden. Forcing users to signup for a third party service is undesriable, and unsustainable when it comes to testing.

Options forward, someone can talk to kaggle about not blocking the dataset. OR just rewrite the example using open data. I have one that pulls from chicago Transit authority, there's 100s on huggingface, there's an entire working group about data sets in the AI alliance

fayvor commented 1 month ago

@rawkintrevo the energy dataset is in the public domain. Ok if we just check it in? It's 6.27MB, but we could truncate it if that's too big.

rawkintrevo commented 1 month ago

if you can find a version that isn't blocked by a kaggle log in then it would be ok as is. as @bjhargrave and @hickeyma have often pointed out, making the user sign up for an additional service is an undesirable path if avoidable.

rawkintrevo commented 1 month ago

@fayvor i would be cautious and double check things before i just lifted kaggle's dataset

wgifford commented 1 month ago

I found a few suspiciously similar datasets here: https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-weather-hourly-spain https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-hourly-spain

Tomorrow I will check if they are consistent with the original ones.

wgifford commented 1 month ago

I have updated the Bike Sharing notebook and will push soon.

wgifford commented 1 month ago

I found a few suspiciously similar datasets here: https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-weather-hourly-spain https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-hourly-spain

Tomorrow I will check if they are consistent with the original ones.

I checked and the above datasets are identical to the original ones from Kaggle. @adampingel are we comfortable using these instead? One obvious risk is that vitaliy-sharandin may choose to remove these at some point, breaking the recipe.

rawkintrevo commented 1 month ago

Works for me if works for @adampingel

wgifford commented 1 month ago

@rawkintrevo In the meantime -- this notebook should be clean enough (I think) for your automated test: https://github.com/ibm-granite-community/granite-timeseries-cookbook/blob/main/recipes/Time_Series/Bike_Sharing_Finetuning_with_Exogenous.ipynb

wgifford commented 1 month ago

Works for me if works for @adampingel

Also -- since the original data is CC0, I don't know why we couldn't host our own copy in the IBM and/or IBM-Granite org on HF. It would make sense to run it by legal first, though.

rawkintrevo commented 1 month ago

I agree, but I know everyone gets real persnickety about it, and it's above my pay grade.

rawkintrevo commented 1 month ago

@wgifford your notebook looks good, can you open a PR or how can we update to that?

adampingel commented 1 month ago

@wgifford @rawkintrevo That's great. Yes, let's move forward with switching the notebook to use the dataset as hosted in HF.

Acknowledged that there is some risk that it will be un-published. We can continue to think about that after we make this change. And we will always have the previous version in our commit history if we need to roll back.

The test are only running when the notebook changes, so we won't get immediate notification if the data goes away, but that's an issue we should solve generally for all cookbooks in a separate effort.

wgifford commented 1 month ago

@rawkintrevo I more or less have updated versions of all 4 notebooks that I can push shortly (or have already merged). These would all be using easily accessible datasets. 3 are using the HF datasets above and 1 is using the UCI dataset.

This one is already updated and merged in main: https://github.com/ibm-granite-community/granite-timeseries-cookbook/blob/main/recipes/Time_Series/Bike_Sharing_Finetuning_with_Exogenous.ipynb

wgifford commented 1 month ago

Two remaining notebooks to merged are fixed in this PR: https://github.com/ibm-granite-community/granite-timeseries-cookbook/pull/19

@rawkintrevo @fayvor Do one of you want to review?

wgifford commented 1 month ago

Closing because I believe this has been resolved. Please re-open if you see some issue.