medvidov / IMaSC

Intelligent Mission and Scientific Instrument Classification. Applying unique NLP approaches to improve information extraction through scientific papers/Foundry A-Team Studies.
Apache License 2.0
0 stars 1 forks source link

License

IMaSC

Intelligent Mission and Scientific Instrument Classification. Applying unique NLP approaches to improve information extraction through scientific papers/Foundry A-Team Studies.

The Data

Available datasets can be found in the data directory. The microwave_limb_sounder dataset contains a data dump of data from an Elasticsearch index, which contains documents with their parsed text (PDFMiner was used to extract text from the PDF documents). The dataset also contains some, but not all, source PDFs. There are 1109 JSON documents but only 604 PDFs. The PDFs could be used with an altetrnative means of text extraction if desired and new machine-readable data generated for use in modeling.

Generating datasets

To generate training, validation, and testing sets, run parser.py with default inputs. This will generate the three files training_set.jsonl , validation_set.jsonl , and testing_set.jsonl in the data/microwave_limb_sounder directory.

Using Prodigy

Prodigy will allow you to annotate your datasets. Please note that my Prodigy wheel installation path is specific to my laptop in requirements.txt at this time.

the search bar. Prodigy should be running on port 8080 by default.

Prodigy will automatically apply the annotation.

and click the "X".

click the green check mark. If a piece of text is not appropriate for annotation, click the grey no symbol to skip it.

What to label

Currently, IMaSC supports labeling of scientific instruments (i.e. MLS ) and the spacecraft (i.e. Aura satellite ) that carry them. Using the directions above, label all instances of scientific instruments and spacecraft in the text.

Handling Uncertainties While Annotating

While reading through data, look out for acronyms or things that have “Satellite” or something similar for spacecraft, or words that end in “meter” or similar for instruments.

When you identify a potential token, reference the annotations.md file and see if it’s already in there. If the term is in the document as a spacecraft or instrument, annotate it accordingly. If it’s in the document as something else (“model” or “other”), ignore it.

If you think you’ve found a token but it isn’t in annotations.md, try to classify it on your own:

Training

Train the model with the following command: prodigy ner.batch-train train_imasc en_core_web_sm -n 100. To train a model with only one entity type run prodigy ner.batch-train train_imasc en_core_web_sm -n 100 -l ENTITY. A flowchart for how to train your specific model can be found here. About 4000 annotations are needed to train the model.

The API (Application Programming Interface)

Included in this repository is a basic API that runs the model on user input data and displays a list of tokens the model found along with coverage data. To use the API, run “python api.py” from the api_stuff directory in the terminal. Your command line should spit out a link you can then use to access the API in your default browser.

In your browser, either enter text or drop a PDF (a recommended PDF is provided in this repository) and click “submit.”

Versioning

Semantic versioning is used for this project. If you are contributing to this project, please use semantic versioning guidelines when submitting your pull request.

Contributing

Please use the issue tracker to report any unexpected behavior or desired features.

If you would like to contribute to development:

Tests

When contributing, please run all existing unit tests. Add new tests as needed when adding new functionality. To run unit tests, use pytest :

python3 -m pytest --cov=IMaSC

License

This project is licensed under the Apache 2.0 license.