Dataset Finder (Data Discovery Tool)

victordibia commented 1 year ago

What

Data analysis and exploration typically begins with the assumption that the right dataset exists. For many scenarios, this assumption holds (e.g., organizational data already exists is a tidy csv or json file). However, for other use cases, the right dataset may not exist and needs to be found.

The high level goal of this functionality is

provide a set of approaches to finding data given some query or representation of the user's intent.

How

Supported approaches may include the following:

Heuristic strategy: define a work flow for identifying datasets that may be relevant. For example, support fixed providers like
- data.gov
- GHO https://www.who.int/data/gho/info/gho-odata-api
- github to find csvs, or json files relevant to queries.
Live agent strategy: define some mechanism that leverages web search in identifying related relevant datasets.

Possibly start off with a a base DataFinder class (find method), HeuristicsDataFinder subclass, AgentDataFinder subclass.

p.s. if you are interested in working on this, please share thoughts on your general approach for discussion and comment.

0xaaiden commented 1 year ago

Is the goal here to allow users to upload their own datasets or to offer a platform for data analysis from a bank of "pre"-provided ready datasets?

victordibia commented 1 year ago

Thanks Aiden. I am leaning more towards supporting discovery of data as opposed to hosting data (we probably can assume the user is able to do this already).

I updated the initial description to add more information

nathantetro commented 1 year ago

Cool! This is what we’re doing at wobby.ai We ingest tons of public data, enrich it with AI and let you analyze it.

Right now were working with journalists, making it easy for them to find data stories in public data.

Would be cool to see this in LIDA. Love this project :)

Check us on:

https://wobby.ai/

microsoft / lida

Dataset Finder (Data Discovery Tool) #10

What

How