Open fenekku opened 2 years ago
I'm planning on trying to update this. I think invenio vocabularies convert -v funders -o "/path/to/ror-data-dump.json.zip" -t affiliations_ror.yaml
is probably close, but happy to use another script if that is available.
In recent imports at NU, we've noticed that YAML shows very bad performance for loading. If you can use .jsonl
file instead it would be drastically faster to load. Example: loading our 72MB+ worth of MeSH terms with YAML took ~240s while the same data in .jsonl
format took 1s (we saw a x149 increase for lcsh terms too). And that's when invenio-cli services setup
executes, so it greatly improves the installation flow.
The invenio vocabularies result is close, but it's in a different order so will make a mess of a diff. ROR is also moving to monthly releases, so updating a static file in the cookicutter is not going to be sustainable. It makes more sense to transfer the affiliation vocabulary to a datastream. I've been able to get it partially working, but am still having issues getting the writers registered. Will update as I get more time to work on it.
Hi, I don't know if this is still a matter for you but we hacked together a Script that formats the ROR dump into the YAML InvenioRDM wants. If someone is interested I could clean it up and share. It's using pandas so filtering would be fairly easy to do.
Is your feature request related to a problem? Please describe.
ROR has released a new data dump (funnily on Zenodo): https://zenodo.org/record/6347575 . This probably means the ROR vocabulary dump here should be updated (or at least reviewed).
As was mentioned in a telecon, the ROR list in this module is a filtered one. Perhaps the filtering process can be shared too.
Describe the solution you'd like
An updated affiliations_ror.yaml.