metaphacts / semopenalex

43 stars 6 forks source link

obtaining complete RDF dataset #84

Closed balhoff closed 9 months ago

balhoff commented 10 months ago

If I want to download the complete dataset for the current release, do I need all the different files for each named graph, or just the latest one? For example under semopenalex / authors, there are files like:

Should I merge all of these? Thanks!

davidlamprecht commented 9 months ago

Hey @balhoff,

If you want to download the complete dataset for a specific release, you need ALL the different files for each named graph.

For the current SemOpenAlex version 4.0.0 this are 141 files for semopenalex/authors. You should merge all of these.

FYI: We follow the structure of the OpenAlex Data Snapshot for the provided folder structure. At the moment we also provide the two previous SemOpenAlex Data Dumps from 2022-11-21/ and 2023-04-24/. However, if you only want to download the latest version, you can ignore these folders and only download all .trig.gz files of the following named graphs: authors/ (141 files for version 4.0.0)
concepts/ (1 file for version 4.0.0)
funders/ (1 file for version 4.0.0)
institutions/ (1 file for version 4.0.0)
publishers/ (1 file for version 4.0.0)
sources/ (1 file for version 4.0.0)
works/ (470 files for version 4.0.0)

balhoff commented 9 months ago

Thanks for the info @davidlamprecht.

VladimirAlexiev commented 3 months ago

You should merge all of these.

A small correction: you could concat them, but I think it's faster to just load each in a semantic repo. For GraphDB, the easiest is to import them as server files in Workbench.