mynlp / jigg

Pipeline framework for easy natural language processing
Apache License 2.0
74 stars 20 forks source link

Sentence and document level parallelization #1

Closed hiroshinoji closed 8 years ago

hiroshinoji commented 9 years ago

SentenceAnnotator trait is designed to abstract sentence level parallelization of all subclass annotators, but currently no parallelization is implemented. Also it is unclear whether currently implemented components such as CCG parser can actually be parallelizable (by calling newSentenceAnnotation method concurrently). Maybe we have to define some rules for a component, which should be followed to enable sentence level parallelization. CoreNLP's SentenceAnnotator class also handles sentence level parallelization in its annotate method, which should be consulted to decide rules and implementation details.

Also, some annotators, such as KNPAnnotator with document-level anaphora requires document-level parallelization instead of sentence-level parallelization. Probably such classes inherit another trait of DocumentAnnotator, which abstracts document-level processing.

hiroshinoji commented 8 years ago

Sentence level parallelization for the currently supported annotators is implemented in d3284c83df2bdde6e659b5baafde07835e7b515e.

Modifications are not so much; each sentence-level annotator modifies elements in <sentences> in parallel using Scala's par. Then, if every annotator is implemented thread-safely, annotations are performed in parallel.

hiroshinoji commented 8 years ago

Simple benchmark in knp:

$ java -Xmx4g -cp "target/jigg-assembly-0.4.jar"  jigg.pipeline.Pipeline -annotators ssplit,juman,knp -file =(head -n 100 sentences.txt)
Annotating /tmp/zshK6dtn1 with ssplit, juman, knp {
  ssplit:  [0.1 sec]
  juman:  [0.5 sec]
  knp:  [16.8 sec]
} [18.9 sec]
Writing to /tmp/zshK6dtn1.xml... done [0.3 sec]
$
$ java -Xmx4g -cp "target/jigg-assembly-0.4.jar"  jigg.pipeline.Pipeline -annotators ssplit,juman,knp -file =(head -n 100 sentences.txt) -nThreads 1
Annotating /tmp/zshLijMgI with ssplit, juman, knp {  # nThreads=1 means no parallel annotation
  ssplit:  [0.0 sec]
  juman:  [0.4 sec]
  knp:  [31.9 sec]
} [34.0 sec]
Writing to /tmp/zshLijMgI.xml... done [0.2 sec] 
hiroshinoji commented 8 years ago

Maybe only the remaining issue is related to #29, which is relevant to documentation for KNP.