Beginning with release v1.45 on 11 April 2024, data releases contain JSON and CSV files formatted according to both schema v1 and schema v2. Version 2 files have _schema_v2 appended to the end of the filename, e.g., v1.45-2024-04-11-ror-data_schema_v2.json. In order to maintain compatibility with previous release, version 1 files have no version information in the filename, e.g., v1.45-2024-04-11-ror-data.json.
:bomb: This breaks the funders convert script, which goes through both v1 and v2 files:
$ invenio vocabularies convert -v funders -o v1.46-2024-05-02-ror-data.zip -t output.yaml
[...]
RORTransformer: Name not found in ROR entry.
RORTransformer: Name not found in ROR entry.
RORTransformer: Name not found in ROR entry.
Vocabulary funders converted. Total items 218710.
109355 items succeeded
109355 contained errors
0 were filtered.
:adhesive_bandage: This pull request:
Adds a negative lookbehind assertion for _schema_v2 before .json.
Escapes the . in all the datastream regexes since the way it was used meant "any character".
:heavy_check_mark: The funders convert script then works as expected:
:heart: Thank you for your contribution!
Partially fixes #305
Description
:books: Quoting the ROR data dump documentation:
:bomb: This breaks the funders convert script, which goes through both v1 and v2 files:
:adhesive_bandage: This pull request:
_schema_v2
before.json
..
in all the datastream regexes since the way it was used meant "any character".:heavy_check_mark: The funders convert script then works as expected:
:information_source: Remark: if and when we move to v2, the regex can easily be changed to
"regex": "_schema_v2\\.json$"
.Checklist
Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:
Frontend
Reminder
By using GitHub, you have already agreed to the GitHub’s Terms of Service including that: