nicolay-r / AREkit

Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML
https://nicolay-r.github.io/arekit-page/
MIT License
58 stars 3 forks source link

Pipelines -- Batching sentences in document parser [ARElight backlog] #535

Closed nicolay-r closed 9 months ago

nicolay-r commented 11 months ago

This is originates from NER application. (https://github.com/nicolay-r/ARElight/issues/118) The snippet below illustrates that we apply text processing pipeline separately for each sentence (text_parser.run). If we want to enhance the document processing performance, there is a need to switch from a single sentence to list of sentences. The latter denotes to support batching.

https://github.com/nicolay-r/AREkit/blob/4c577cb52eb4aabd547c80f939bdf05edb908634/arekit/common/docs/parser.py#L19-L25

nicolay-r commented 10 months ago

Proposal for the pipeline core refactoring:

image