Open T-Almeida opened 2 weeks ago
i believe the bug is related to document splitting, specifically data.py
line 222:
text": doc['text'][low_offset: high_offset],
I changed the code so the text field only contains its repsective offset text. I have not verified, but when constructing the results text it will take the text field from the first chunk.
This is then corresponded to line 42 in inference.py:
text = documents[doc][0]["text"]
Taking only the first document as text
We can either fix the code in inference or in data.
I think its better to fix it in inference.
I believe merge #2 introduced a bug where some entities are missing their associated text in the output file. Here's a comparison:
New output file:
Old output file (from commit 11da870121cfa398babd5238d2ae698e6caedf11):