Closed atopal closed 1 year ago
sounds neat 😉
And it would be really neat to have machine learning applications in mind when tackling this. Machine Learning is a very hot topic and requires a lot of data. Furthermore, for scientific machine learning, reproducibility is essential. Having processes that produce a well known output for a specific input helps a lot. So, having tools that deterministically identify specific versions of datasets (i.e. inputs and outputs of machine learning models) would be very beneficial.
I'm a natural language processing researcher and work a lot with https://github.com/huggingface/datasets. This tool + collection of scripts provides a promising way to easily integrate plenty of various data into the machine learning model of your choice. However, versioning is a pain, the original data is usually saved on a single server, and creating own datasets or deriving new ones by (automatically) annotating existing datasets is not transparently modeled, which often results in non-reproducibility for the dataset creation process. I really like the concepts around IPFS and think that there are a lot of potential synergies regarding the field of machine learning.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!
Theme description
Create a delightful experience for storing and working with datasets on IPFS by making (an) awesome application stack(s) that includes storage, replication, retrieval, and computation, and improving the necessary parts of the core implementations to enable these use cases. ~20% of the world’s important datasets are stored on these systems.
Hypothesis
IPFS’s ability to enable accessibility / portability / extensibility of data is a great fit for many dataset applications, solving many problems that dataset storage and retrieval faces in web2 models. Current IPFS implementations are not far off in being able to address almost all these problems and capture these use cases.
Vision statement
There is a rich ecosystem of IPFS-based applications that support onboarding, versioning, and utilizing major datasets, and they are the premier places to store and interface with the world’s most important datasets. This data has gravity, with a rich, budding application ecosystem being built on top of these stored datasets to address many end use cases. Tooling improves from the feedback loop generated by building these products on IPFS.
Why focus this year
This is a use case for which IPFS can likely provide a ton of value, even in its current level of maturity (e.g., before read/write privacy). The use case is large and important. Further, the value Filecoin provides as a backup medium and the momentum from its ecosystem makes it a great opportunity to focus on this in 2021.
Example workstreams
Development of IPFS-based applications to store, replicate, serve, and process many types and applications of datasets, improvement of core implementations to handle large datasets (e.g., scaling ability to handle large volumes of provider records, throughput on transport connections and saturating those connections, connect to multiple providers), addition of maintainers of key datasets to IPFS ecosystem