lappsgrid-incubator / galaxy-paper-rank

Project home for the paper identification project for Galaxy.
Apache License 2.0
0 stars 0 forks source link

Sentence Segmentation #4

Open ksuderman opened 3 years ago

ksuderman commented 3 years ago

We will need to be able to extract all sentences that use the word Galaxy from an input document. This implies that we are able to split an input document on sentence boundaries.

NLTK will be sufficient for testing and development, but may not be sufficient (time or space) in large scale production. Consider using Stanford CoreNLP, Apache OpenNLP, or something else as a standalone service for common tasks like tokenization and sentence splitting. The Lappsgrid can provide standalone Dockerized services for this that communicate via REST or AMQP.

nancyide commented 3 years ago

This task should come after we get the machine learning up and established, as this information will be used mainly to help Dave determine how accurate the model is for new data he encounters.