inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

Duplicated code that generates 'acquisition_source' #175

Closed iulianav closed 7 years ago

iulianav commented 7 years ago

@david-caro

Expected Behavior

In order to ensure the schema is respected, the LiteratureBuilder defined in inspire-schemas should be used in order to generate the 'acquisition_source' and the 'submission_number' field should not be populated for hepcrawl ingested records -- -- check: http://inspire-schemas.readthedocs.io/en/latest/schemas/records/elements/acquisition_source.html#acquisition-source-json ..

Current Behavior

Generating the 'acquisition_source' is done in some places https://github.com/inspirehep/hepcrawl/blob/e749a26ca9b77f61c5abb20b63e295a4f75a6508/hepcrawl/tohep.py#L229-L231 correctly via the LiteratureBuilder and in some other places https://github.com/inspirehep/hepcrawl/blob/e749a26ca9b77f61c5abb20b63e295a4f75a6508/hepcrawl/tohep.py#L142-L149 by using a function that does a similar thing. This function also incorrectly populates the 'submission_number' field for hepcrawl records with the SCRAPY_JOB id.

Steps to Reproduce (for bugs)

  1. Trigger a crawler in order to ingest an arxiv record.
  2. Check the xml format of the record, especially the '541_e' field.

Context

Screenshots (if appropriate):

iulianav commented 7 years ago

@michamos now that the 'submission_number' field is going to be later populated for hepcrawl records as well, the schema is no longer accurate, nor the name of the field. Right?

michamos commented 7 years ago

you are right, we could rename it to holdingpen_record and let it be a json_reference to the holdingpen record.

david-caro commented 7 years ago

Related to https://github.com/inspirehep/inspire-next/issues/2687