togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Unlock open science for dataset generation #66

Open AbcSxyZ opened 11 months ago

AbcSxyZ commented 11 months ago

Hello everyone,

While navigating openness in IA, I ended up here and was wondering which open science sources you would use for this kind of tool. I found only ArXiv listed. Do you have any thinking to include more open science sources ?

I'm not sure if it's because there is a lack of understanding on how open access publication works, but I was thinking that maybe with some explanation it can help in the development of some tools to extract text from millions of scientific articles. Surely something which takes time to create, I do not expect to develop it myself, just trying to give some help to open useful discussions.

Actually, I'm doing open models education (open science, open education, open software, open hardware...), just doing some here.

Quick landscape of Open Science

Open science is going mainstream in science policies, the White House announced 2023 as the year of Open Science. It becomes more and more mandatory to publish in open access for researcher working on public fund, countries are having open science policies, fuelled by crisis like covid.

Universities and organization by themselves are involved in this evolution, as there are interests for scientific diffusion, quality, equity...

Organisation are installing open access repository where they save their content. It's called DASH at Harvard, DSpace at MIT, CERN is hosting a shared platform called Zenodo and so on. A lot of university have their own OA repository.

Explore open access repositories worldwide

All of these repositories are decentralised and you need a way to access multiple of them at once to perform effective searches. There are open science search engine like CORE, with an access to a wide number of organisation (~10'000).

They do have an API, but it may be not the most interesting way to perform this kind of tasks.

2 things :

There are potentially some OAI-PMH queries to get all information about repository content, some paths to explore ? Hope it could help to dig into open science.

Shell example with Zenodo (with a command where I'm not sure on the percentage of resources metadata extracted) :

pip install oaiharvest
oai-harvest https://zenodo.org/oai2d -d oai_dc
ls oai_dc
### Tasks