Processing xDD Data

Code for processing xDD files, only useful if you have access to the xDD data

Processing involves several steps, some of them done in this repository:

Document structure parsing (done using the code in https://github.com/lapps-xdd/xdd-docstructure).
Extracting named entities with spaCy.
Generating term lists (done using the code in https://github.com/lapps-xdd/xdd-terms).
Merging document structure, named entities, terms and metadata.
Preparing the file that will be imported into the database.

To see how to integrate all this processing see xdd-integration.

1. Document structure parsing

See https://github.com/lapps-xdd/xdd-docstructure.

2. Named entity extraction

Use the script ner.py in this repository, which requires spaCy to run.

$ pip install spacy==3.5.1
$ python -m spacy download en_core_web_sm

To run the script do

$ python ner.py --doc DIR1 --pos DIR2 --ner DIR3 [--limit N]

The input in DIR1 should have files with the output from the document structure parser. Part-of-speech data is written to DIR2 and named entities to DIR3. If LIMIT is used than no more than N files will be processed.

3. Term extraction

See https://github.com/lapps-xdd/xdd-terms.

4. Merging

Requires output from previous processing stages as well as a file with metadata.

$ python merge.py --scpa DIR1 --doc DIR2 --ner DIR3 --trm DIR4 --meta FILE --out DIR5 [--limit N]

For input we have ScienceParse results (DIR1), document parser results (DIR2), named entities (DIR3), terms (DIR4) and a metadata file. Output is written to DIR5. See merge.py for example usage.

5. Preparing the database file

Created from the merged data with prepare_elastic.py:

$ python prepare_elastic.py -i DIR1 -o DIR2 [--domain DOMAIN] [--limit N]

Takes merged files from DIR1 and creates a file elastic.json in DIR2. The file has pairs of lines as required by ElasticSearch (the second line is spread out over a couple of lines for clarity, it really is only one line, otherwise ElasticSearch fails to load it):

{"index": {"_id": "54b4324ee138239d8684aeb2"}}}
{
  "domain": "biomedical",
  "name": "54b4324ee138239d8684aeb2",
  "year": 2010,
  "title": "Nanomechanical properties of modern and fossil bone",
  "authors": ["Sara E. Olesiak", "Matt Sponheimer", "Jaelyn J. Eberle", "Michelle L. Oyen"],
  "abstract": "Relatively little is known about how diagenetic processes affect ...",
  "url": "http://www.sciencedirect.com/science/article/pii/S0022283609014053",
  "text": "...",
  "summary": "...",
  "terms": [...],
  "entities": {...}
}