utdemir / distributed-dataset

A distributed data processing framework in Haskell.
BSD 3-Clause "New" or "Revised" License
114 stars 5 forks source link

Support more open datasets #27

Open utdemir opened 4 years ago

utdemir commented 4 years ago

This is an umbrella issue to gather useful datasets which can be used freely.

Things to consider:

Useful Collections:

utdemir commented 4 years ago

Amazon Customer Reviews Dataset

utdemir commented 4 years ago

CommonCrawl

utdemir commented 4 years ago

Global Database of Events, Languages and Tone

utdemir commented 4 years ago

It might be an interesting experiment to implement a dataset of the Bitcoin transactions, if we have a way to process them in a partitioned way.

aycanirican commented 4 years ago

https://www.ligo.caltech.edu/page/ligo-data

utdemir commented 4 years ago

https://www.data.gov/ has a lot of open datasets. They tend to be small on size, but there probably are exceptions.

utdemir commented 4 years ago

Tons of taxi trips, the partitioning seem ideal for distributed-dataset. Will have to investigate how performant the website is.

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

arobertn commented 3 years ago

Not mentioned (as far as I could see) in google's list:

http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html

Googling for "google ngram data" will turn up lots of scripts various people have developed for munging this data.