tomaarsen / SpanMarkerNER

SpanMarker for Named Entity Recognition
https://tomaarsen.github.io/SpanMarkerNER/
Apache License 2.0
391 stars 27 forks source link

should return same no. of list as of inputs #35

Closed mallapraveen closed 11 months ago

mallapraveen commented 12 months ago

When inputs are of same, predict should return those many lists. Instead gives 1.

inputs = ["Unknown", "Unknown", "Unknown"] model = SpanMarkerModel.from_pretrained("model_name") model.predict(inputs)

Output: []

Expected Output: [[], [], []]

tomaarsen commented 12 months ago

Hello! When the input is a list of strings without spaces, like in ["Unknown", "Unknown", "Unknown"], then the model thinks that the text is an already tokenized sentence, like ["My", "name", "is", "Tom", "."], instead of three separate texts. If you try to provide the model with a list of strings that do have spaces, like ["My name is Tom", "I'm from the Netherlands", "I work at Argilla"] then it should give the expected output with three lists. I hope that clarifies it somewhat.

mallapraveen commented 12 months ago

Yes @tomaarsen it does. So, when i am trying to do batch processing where i send the entire data frame column(i replace Nan with "Unknown") and the columns starts with 2 "Unknown", then it throws error. for ex: this list ["Unknown", "Unknown", "Unknown", "Sentence 1", "Sentence 2"].

tomaarsen commented 11 months ago

That makes sense, I should look into a better solution for that. Until then, perhaps you can filter out nan rather than replacing it with "Unknown"? E.g. https://stackoverflow.com/questions/22551403/python-pandas-filtering-out-nan-from-a-data-selection-of-a-column-of-strings

mallapraveen commented 11 months ago

Thanks @tomaarsen Waiting for the fix.

tomaarsen commented 11 months ago

["Unknown", "Unknown", "Unknown", "Sentence 1", "Sentence 2"] should work now! However, if all sentences are individual words, then it will still consider it all one sentence. I don't picture a great way to avoid that. See also #39 for details.

mallapraveen commented 11 months ago

@tomaarsen Thanks for the fix.