inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

workflows: report malformed and missing IDs in authors XML files #3033

Open michamos opened 6 years ago

michamos commented 6 years ago

Context

When an author has not been assigned an INSPIRE ID yet, the collaborations put all kinds of placeholders in the field corresponding to the ID, like None or ???, or leave it empty.

Current Behavior

Because of this , after extracting author information from the authors XML file, the record might be invalid, or some authors might be lacking an ID without us noticing.

Expected Behavior

The invalid authors are ignored, and an RT ticket is created with information about the record and the authors having invalid or missing IDs.

Note

It might make sense to rewrite the authors XML extraction using parsel (the library powering scrapy XML parsing) and the SignatureBuilder instead of bolting this behavior on top of the current XSLT+dojson pipeline.

cc @hoc3426 @annetteholtkamp

michamos commented 6 years ago

Besides, we would need to have a list of mappings similar to the one in inspirehep/inspire#351 to automatically fix incorrect INSPIRE IDs.

hoc3426 commented 6 years ago

I think we'd want to know about malformed IDs. Missing ones would probably generate too much work at first.