Open muazhari opened 2 years ago
Thank you for reporting the issue.
From the sounds of it, you used txtai to extract text from a PDF document and now are trying to highlight text like this example - https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb ?
Hard to tell exactly but it's possible that there is more that needs to be done to clean out control characters from the extracted text.
I've found the best way to debug issues like this is to start with small piece of the result text (like a single word) and then iteratively add text in until you reproduce a non-match that should have been a match. That will most likely expose characters to clean out that will allow a full match.
I tried to highlight the entire document with its list of sentences that parsed by txtai pipeline extractor, but not all of them were highlighted. Everything should be highlighted if this is done. Can anyone help me?