inveniosoftware / cookiecutter-invenio-rdm

Cookiecutter template for a new InvenioRDM instance.
MIT License
3 stars 33 forks source link

New ROR data dump #213

Open fenekku opened 2 years ago

fenekku commented 2 years ago

Is your feature request related to a problem? Please describe.

ROR has released a new data dump (funnily on Zenodo): https://zenodo.org/record/6347575 . This probably means the ROR vocabulary dump here should be updated (or at least reviewed).

As was mentioned in a telecon, the ROR list in this module is a filtered one. Perhaps the filtering process can be shared too.

Describe the solution you'd like

An updated affiliations_ror.yaml.

tmorrell commented 2 years ago

I'm planning on trying to update this. I think invenio vocabularies convert -v funders -o "/path/to/ror-data-dump.json.zip" -t affiliations_ror.yaml is probably close, but happy to use another script if that is available.

fenekku commented 2 years ago

In recent imports at NU, we've noticed that YAML shows very bad performance for loading. If you can use .jsonl file instead it would be drastically faster to load. Example: loading our 72MB+ worth of MeSH terms with YAML took ~240s while the same data in .jsonl format took 1s (we saw a x149 increase for lcsh terms too). And that's when invenio-cli services setup executes, so it greatly improves the installation flow.

tmorrell commented 2 years ago

The invenio vocabularies result is close, but it's in a different order so will make a mess of a diff. ROR is also moving to monthly releases, so updating a static file in the cookicutter is not going to be sustainable. It makes more sense to transfer the affiliation vocabulary to a datastream. I've been able to get it partially working, but am still having issues getting the writers registered. Will update as I get more time to work on it.

karkraeg commented 3 months ago

Hi, I don't know if this is still a matter for you but we hacked together a Script that formats the ROR dump into the YAML InvenioRDM wants. If someone is interested I could clean it up and share. It's using pandas so filtering would be fairly easy to do.