tomaarsen / SpanMarkerNER

SpanMarker for Named Entity Recognition
https://tomaarsen.github.io/SpanMarkerNER/
Apache License 2.0
401 stars 28 forks source link

spaCy_integration `.pipe()` does not behave as expected #37

Closed q-jackboylan closed 1 year ago

q-jackboylan commented 1 year ago

I have created a pipeline like so:

self.model = spacy.load("en_core_web_md", disable=[
            "tagger",
            "lemmatizer",
            "attribute_ruler",
            "ner",])
self.model.add_pipe(
    "span_marker",
    config={"model": span_marker_model_path, "batch_size": batch_size},
)

I call pipe() on a stream of documents:

for name, proc in self.model.pipeline:
        stream2 = proc.pipe(stream2)

The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.

    def pipe(self, stream, batch_size=128):
        """Fill `doc.ents` and `span.label_` using the chosen SpanMarker model."""
        if isinstance(stream, str):
            stream = [stream]

        if not isinstance(stream, types.GeneratorType):
            stream = self.nlp.pipe(stream, batch_size=batch_size)

        for docs in minibatch(stream, size=batch_size):
            inputs = [[token.text if not token.is_space else "" for token in doc] for doc in docs]

            # use document-level context in the inference if the model was also trained that way
            if self.model.config.trained_with_document_context:
                inputs = self.convert_inputs_to_dataset(inputs)

            entities_list = self.model.predict(inputs, batch_size=self.batch_size)
            for doc, entities in zip(docs, entities_list):
                ents = []
                for entity in entities:
                    start = entity["word_start_index"]
                    end = entity["word_end_index"]
                    span = doc[start:end]
                    span.label_ = entity["label"]
                    ents.append(span)

                self.set_ents(doc, ents)

                yield doc

So it reaches max sequence length pretty quickly and only annotates the first part of each document.

This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.

tomaarsen commented 1 year ago

You're right - well spotted. The pipe should definitely mirror the behaviour of just __call__, except with a list of texts. I'll throw this on my TODO list.