for name, proc in self.model.pipeline:
stream2 = proc.pipe(stream2)
The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.
def pipe(self, stream, batch_size=128):
"""Fill `doc.ents` and `span.label_` using the chosen SpanMarker model."""
if isinstance(stream, str):
stream = [stream]
if not isinstance(stream, types.GeneratorType):
stream = self.nlp.pipe(stream, batch_size=batch_size)
for docs in minibatch(stream, size=batch_size):
inputs = [[token.text if not token.is_space else "" for token in doc] for doc in docs]
# use document-level context in the inference if the model was also trained that way
if self.model.config.trained_with_document_context:
inputs = self.convert_inputs_to_dataset(inputs)
entities_list = self.model.predict(inputs, batch_size=self.batch_size)
for doc, entities in zip(docs, entities_list):
ents = []
for entity in entities:
start = entity["word_start_index"]
end = entity["word_end_index"]
span = doc[start:end]
span.label_ = entity["label"]
ents.append(span)
self.set_ents(doc, ents)
yield doc
So it reaches max sequence length pretty quickly and only annotates the first part of each document.
This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.
You're right - well spotted. The pipe should definitely mirror the behaviour of just __call__, except with a list of texts. I'll throw this on my TODO list.
I have created a pipeline like so:
I call
pipe()
on a stream of documents:The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.
So it reaches max sequence length pretty quickly and only annotates the first part of each document.
This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.