i2mint / py2store

Tools to create simple and consistent interfaces to complicated and varied data sources.
MIT License
11 stars 2 forks source link

py2store datasets #54

Open thorwhalen opened 4 years ago

thorwhalen commented 4 years ago

This would be a separate py2store dependent repository.

The objective of this project is to offer easy and consistent access to various datasets.

We'll start with dataset providers that have a lot of data (so that we can get a lot out of the py2store wrapper we'll make for it).

The interface should start off as other hierarchical explorers such as for files (folders, subfolders, files) or DBs (e.g. mongo host>dbs>collections or sql connection>dbs>tables). For example, the first level of listing would list the data providers or other named groups (with a misc for the catch all unclassified). For example:

>>> list(data_malls)
['kaggle', 'who', 'roda', 'misc']
>>> datasets = data_malls['kaggle']
>>> list(datasets)
['us_food_habits', 'covid_19', ...
>>> dataset = datasets['covid_19']
# etc.

Check out if there's already a python lib to connect to the data provider (mall). Check out API. If API easy to use py2request (all we need is listing and download capabilities), use raw API. If not use python lib if available.

Caching

We want to use caching smartly and automatically (with automatic refreshes on a schedule, and/or warnings when a refresh hasn't happened for awhile. We want to cache both listings as well as metadata and data.

Depending on the context, the cache could work in many ways. For example:

Dataset providers

More links

https://www.freecodecamp.org/news/https-medium-freecodecamp-org-best-free-open-data-sources-anyone-can-use-a65b514b0f2d/