huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

Add time series data - stock market #4104

Open INF800 opened 2 years ago

INF800 commented 2 years ago

Adding a Time Series Dataset

image

INF800 commented 2 years ago

Can I use instructions present in below link for time series dataset as well? https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md

julien-c commented 2 years ago

cc'ing @kashif and @NielsRogge for visibility!

kashif commented 2 years ago

@INF800 happy to add this dataset! I will try to set a PR by the end of the day... if you can kindly point me to the dataset? Also, note we have a bunch of time series datasets checked in e.g. electricity_load_diagrams or monash_tsf, and ideally this dataset could also be in a similar format.

INF800 commented 2 years ago

Thankyou. This is how raw data looks like before cleaning for an individual stocks:

  1. https://github.com/INF800/marktech/tree/raw-data/f/data/raw
  2. https://github.com/INF800/marktech/tree/raw-data/t/data/raw
  3. https://github.com/INF800/marktech/tree/raw-data/rdfn/data/raw
  4. https://github.com/INF800/marktech/tree/raw-data/irbt/data/raw
  5. https://github.com/INF800/marktech/tree/raw-data/hll/data/raw
  6. https://github.com/INF800/marktech/tree/raw-data/infy/data/raw
  7. https://github.com/INF800/marktech/tree/raw-data/reli/data/raw
  8. https://github.com/INF800/marktech/tree/raw-data/hdbk/data/raw

Scraping is automated using GitHub Actions. So, everyday we will see a new file added in the above links.

I can rewrite the cleaning scripts to make sure it fits HF dataset standards. (P.S I am very much new to HF dataset)

The data set above can be converted into univariate regression / multivariate regression / sequence to sequence generation dataset etc. So, do we have some kind of transformation modules that will read the dataset as some type of dataset (GenericTimeData) and convert it to other possible dataset relating to a specific ML task. By having this kind of transformation module, I only have to add data once and use transformation module whenever necessary

Additionally, having some kind of versioning for the dataset will be really helpful because it will keep on updating - especially time series datasets

kashif commented 2 years ago

thanks @INF800 I'll have a look. I believe it should be possible to incorporate this into the time-series format.

INF800 commented 2 years ago

Referencing https://github.com/qingsongedu/time-series-transformers-review

kashif commented 2 years ago

@INF800 yes I am aware of the review repository and paper which is more or less a collection of abstracts etc. I am working on a unified library of implementations of these papers together with datasets to be then able to compare/contrast and build upon the research etc. but I am not ready to share them publicly just yet.

In any case regarding your dataset at the moment its seems from looking at the csv files, its mixture of textual and numerical data, sometimes in the same column etc. As you know, for time series models we would need just numeric data so I would need your help in disambiguating the dataset you have collected and also perhaps starting with just numerical data to start with...

Do you think you can make a version with just numerical data?

INF800 commented 2 years ago

@INF800 yes I am aware of the review repository and paper which is more or less a collection of abstracts etc. I am working on a unified library of implementations of these papers together with datasets to be then able to compare/contrast and build upon the research etc. but I am not ready to share them publicly just yet.

In any case regarding your dataset at the moment its seems from looking at the csv files, its mixture of textual and numerical data, sometimes in the same column etc. As you know, for time series models we would need just numeric data so I would need your help in disambiguating the dataset you have collected and also perhaps starting with just numerical data to start with...

Do you think you can make a version with just numerical data?

Will share the numeric data and conversion script within end of this week.

I am on a business trip currently - it is in my desktop.

kashif commented 1 month ago

thanks @INF800 kashif.rasul@gmail.com should work

INF800 commented 1 month ago

It should be in your inbox!

On Sun, 21 Jul, 2024, 9:44 pm Kashif Rasul, @.***> wrote:

thanks @INF800 https://github.com/INF800 @.*** should work

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/4104#issuecomment-2241701256, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4GSXLHCOGNTU5ERJ6M3ITZNPM6TAVCNFSM6AAAAABLG65FLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBRG4YDCMRVGY . You are receiving this because you were mentioned.Message ID: @.***>