openaire / iis

Information Inference Service of the OpenAIRE system
Apache License 2.0
20 stars 11 forks source link

Implement WileyML XML records parser producing DocumentText datastore #1440

Open marekhorst opened 9 months ago

marekhorst commented 9 months ago

Originally requested in: https://support.openaire.eu/issues/8896#note-98

This parser should be responsible for:

Currently an input is DocumentText datastore with file name set as id and WileyML record as text.

We could start with id extraction first and propagate full XML record as text in the begining. Instead of relying on a file name, which currently identifies Wiley XMLs in the DocumentText avro datastore we should identify those XML records with something better like DOI. DOIs are available in the XMLs so it should be possible to extract them. There are multiple DOIs defined for a single WileyML record (e.g. identifying Journal or issue apart from identifying article) so we should pick carefully the right DOI and pick a replacement whenever article DOI is not available.