inspirehep / invenio-grobid

Invenio package for integration of the Grobid metadata extraction service
GNU General Public License v2.0
4 stars 3 forks source link

Map TEI format to data model #2

Closed jalavik closed 8 years ago

jalavik commented 9 years ago

We need to convert TEI format to our data model. Here are the proposed steps:

  1. Parse XML (e.g. using PyXB https://gist.github.com/jacquerie/88c69a844abe276546ec)
  2. Map parsed XML into a dictionary for dojson to consume (similar to https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/utils.py#L27)
  3. Create dojson mappings inside inspire-next/dojson/tei to map the dict from (2) to our data model - following the same structure and schema as the inspire-next/dojson/hep mappings. Similar to https://github.com/inspirehep/inspire-next/blob/master/inspire/dojson/hep/fields/bd1xx.py using some new processor like https://github.com/inspirehep/inspire-next/blob/master/inspire/dojson/processors.py#L25-L44
  4. Integrate this data conversion into grobid in a generic way (no mention of inspire) Perhaps as a config variable GROBID_RECORD_PROCESSOR = "some.dojson.processor:convert_tei" with similar logic as https://github.com/inveniosoftware/invenio-records/blob/master/invenio_records/manage.py#L55-L58
  5. The final json mapping it can be returned to the grobid interface. Either as JSON dump, or better yet, using jsoneditor with jsonschema to generate a form with the data prefilled right from the start.
jalavik commented 8 years ago

Done by @jacquerie