Closed rawkintrevo closed 1 month ago
@rawkintrevo the energy dataset is in the public domain. Ok if we just check it in? It's 6.27MB, but we could truncate it if that's too big.
if you can find a version that isn't blocked by a kaggle log in then it would be ok as is. as @bjhargrave and @hickeyma have often pointed out, making the user sign up for an additional service is an undesirable path if avoidable.
@fayvor i would be cautious and double check things before i just lifted kaggle's dataset
I found a few suspiciously similar datasets here: https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-weather-hourly-spain https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-hourly-spain
Tomorrow I will check if they are consistent with the original ones.
I have updated the Bike Sharing notebook and will push soon.
I found a few suspiciously similar datasets here: https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-weather-hourly-spain https://huggingface.co/datasets/vitaliy-sharandin/energy-consumption-hourly-spain
Tomorrow I will check if they are consistent with the original ones.
I checked and the above datasets are identical to the original ones from Kaggle. @adampingel are we comfortable using these instead? One obvious risk is that vitaliy-sharandin may choose to remove these at some point, breaking the recipe.
Works for me if works for @adampingel
@rawkintrevo In the meantime -- this notebook should be clean enough (I think) for your automated test: https://github.com/ibm-granite-community/granite-timeseries-cookbook/blob/main/recipes/Time_Series/Bike_Sharing_Finetuning_with_Exogenous.ipynb
Works for me if works for @adampingel
Also -- since the original data is CC0, I don't know why we couldn't host our own copy in the IBM and/or IBM-Granite org on HF. It would make sense to run it by legal first, though.
I agree, but I know everyone gets real persnickety about it, and it's above my pay grade.
@wgifford your notebook looks good, can you open a PR or how can we update to that?
@wgifford @rawkintrevo That's great. Yes, let's move forward with switching the notebook to use the dataset as hosted in HF.
Acknowledged that there is some risk that it will be un-published. We can continue to think about that after we make this change. And we will always have the previous version in our commit history if we need to roll back.
The test are only running when the notebook changes, so we won't get immediate notification if the data goes away, but that's an issue we should solve generally for all cookbooks in a separate effort.
@rawkintrevo I more or less have updated versions of all 4 notebooks that I can push shortly (or have already merged). These would all be using easily accessible datasets. 3 are using the HF datasets above and 1 is using the UCI dataset.
This one is already updated and merged in main: https://github.com/ibm-granite-community/granite-timeseries-cookbook/blob/main/recipes/Time_Series/Bike_Sharing_Finetuning_with_Exogenous.ipynb
Two remaining notebooks to merged are fixed in this PR: https://github.com/ibm-granite-community/granite-timeseries-cookbook/pull/19
@rawkintrevo @fayvor Do one of you want to review?
Closing because I believe this has been resolved. Please re-open if you see some issue.
recipes/Time_Series/Preprocessor_Use_and_Performance_Evaluation.ipynb and recipes/Time_Series/Time_Series_Getting_Started.ipynb
use datasets from Kaggle's walled garden. Forcing users to signup for a third party service is undesriable, and unsustainable when it comes to testing.
Options forward, someone can talk to kaggle about not blocking the dataset. OR just rewrite the example using open data. I have one that pulls from chicago Transit authority, there's 100s on huggingface, there's an entire working group about data sets in the AI alliance