neuml / txtmarker

Highlight text in documents
Apache License 2.0
73 stars 11 forks source link

Highlighter Can't Highlight All the Text in the Document #8

Open muazhari opened 2 years ago

muazhari commented 2 years ago

I tried to highlight the entire document with its list of sentences that parsed by txtai pipeline extractor, but not all of them were highlighted. Everything should be highlighted if this is done. Can anyone help me?


highlighter = Factory.create("pdf")
highlights = [(None, re.escape(sent)) for sent in sent_list]
highlighter.highlight(in, out, highlights)
davidmezzetti commented 2 years ago

Thank you for reporting the issue.

From the sounds of it, you used txtai to extract text from a PDF document and now are trying to highlight text like this example - https://github.com/neuml/txtmarker/blob/master/examples/02_Highlighting_with_Transformers.ipynb ?

Hard to tell exactly but it's possible that there is more that needs to be done to clean out control characters from the extracted text.

I've found the best way to debug issues like this is to start with small piece of the result text (like a single word) and then iteratively add text in until you reproduce a non-match that should have been a match. That will most likely expose characters to clean out that will allow a full match.