lapps-xdd / xdd-processing

0 stars 0 forks source link

Processing xDD Data

Code for processing xDD files, only useful if you have access to the xDD data

Processing involves several steps, some of them done in this repository:

  1. Document structure parsing (done using the code in https://github.com/lapps-xdd/xdd-docstructure).
  2. Extracting named entities with spaCy.
  3. Generating term lists (done using the code in https://github.com/lapps-xdd/xdd-terms).
  4. Merging document structure, named entities, terms and metadata.
  5. Preparing the file that will be imported into the database.

To see how to integrate all this processing see xdd-integration.

1. Document structure parsing

See https://github.com/lapps-xdd/xdd-docstructure.

2. Named entity extraction

Use the script ner.py in this repository, which requires spaCy to run.

$ pip install spacy==3.5.1
$ python -m spacy download en_core_web_sm

To run the script do

$ python ner.py --doc DIR1 --pos DIR2 --ner DIR3 [--limit N]

The input in DIR1 should have files with the output from the document structure parser. Part-of-speech data is written to DIR2 and named entities to DIR3. If LIMIT is used than no more than N files will be processed.

3. Term extraction

See https://github.com/lapps-xdd/xdd-terms.

4. Merging

Requires output from previous processing stages as well as a file with metadata.

$ python merge.py --scpa DIR1 --doc DIR2 --ner DIR3 --trm DIR4 --meta FILE --out DIR5 [--limit N]

For input we have ScienceParse results (DIR1), document parser results (DIR2), named entities (DIR3), terms (DIR4) and a metadata file. Output is written to DIR5. See merge.py for example usage.

5. Preparing the database file

Created from the merged data with prepare_elastic.py:

$ python prepare_elastic.py -i DIR1 -o DIR2 [--domain DOMAIN] [--limit N] 

Takes merged files from DIR1 and creates a file elastic.json in DIR2. The file has pairs of lines as required by ElasticSearch (the second line is spread out over a couple of lines for clarity, it really is only one line, otherwise ElasticSearch fails to load it):

{"index": {"_id": "54b4324ee138239d8684aeb2"}}}
{
  "domain": "biomedical",
  "name": "54b4324ee138239d8684aeb2",
  "year": 2010,
  "title": "Nanomechanical properties of modern and fossil bone",
  "authors": ["Sara E. Olesiak", "Matt Sponheimer", "Jaelyn J. Eberle", "Michelle L. Oyen"],
  "abstract": "Relatively little is known about how diagenetic processes affect ...",
  "url": "http://www.sciencedirect.com/science/article/pii/S0022283609014053",
  "text": "...",
  "summary": "...",
  "terms": [...],
  "entities": {...}
}