Code for processing xDD files, only useful if you have access to the xDD data
Processing involves several steps, some of them done in this repository:
To see how to integrate all this processing see xdd-integration.
See https://github.com/lapps-xdd/xdd-docstructure.
Use the script ner.py
in this repository, which requires spaCy to run.
$ pip install spacy==3.5.1
$ python -m spacy download en_core_web_sm
To run the script do
$ python ner.py --doc DIR1 --pos DIR2 --ner DIR3 [--limit N]
The input in DIR1 should have files with the output from the document structure parser. Part-of-speech data is written to DIR2 and named entities to DIR3. If LIMIT is used than no more than N files will be processed.
See https://github.com/lapps-xdd/xdd-terms.
Requires output from previous processing stages as well as a file with metadata.
$ python merge.py --scpa DIR1 --doc DIR2 --ner DIR3 --trm DIR4 --meta FILE --out DIR5 [--limit N]
For input we have ScienceParse results (DIR1), document parser results (DIR2), named entities (DIR3), terms (DIR4) and a metadata file. Output is written to DIR5. See merge.py for example usage.
Created from the merged data with prepare_elastic.py
:
$ python prepare_elastic.py -i DIR1 -o DIR2 [--domain DOMAIN] [--limit N]
Takes merged files from DIR1 and creates a file elastic.json
in DIR2. The file has pairs of lines as required by ElasticSearch (the second line is spread out over a couple of lines for clarity, it really is only one line, otherwise ElasticSearch fails to load it):
{"index": {"_id": "54b4324ee138239d8684aeb2"}}}
{
"domain": "biomedical",
"name": "54b4324ee138239d8684aeb2",
"year": 2010,
"title": "Nanomechanical properties of modern and fossil bone",
"authors": ["Sara E. Olesiak", "Matt Sponheimer", "Jaelyn J. Eberle", "Michelle L. Oyen"],
"abstract": "Relatively little is known about how diagenetic processes affect ...",
"url": "http://www.sciencedirect.com/science/article/pii/S0022283609014053",
"text": "...",
"summary": "...",
"terms": [...],
"entities": {...}
}