Open chang-zhao opened 2 years ago
I search HTML documents and often get html tags or their parts in the highlighter results, like:
Абсолютная истина.<br />Не-противостояние и растворение напряжений</h1...Абсолютная истина</h4...В этом мире нет
Most often, these tags or tag pieces are:
</p <br /> </h1, </h2 and so on </em></strong>
I switched to using SentenceFragmenter (which is also more suitable for my needs):
results.fragmenter = highlight.SentenceFragmenter( maxchars=240, sentencechars='</>.!?', charlimit = None )
so it should filter all that out, but it doesn't work. I even tried to escape those characters like this:
sentencechars='\<\/\>.!?'
Nope. It seems I will have to resort to additional search and replace.
Here's how I clean it: https://gist.github.com/chang-zhao/2a18dcab0b40e3011decefb65c91b4ca
I search HTML documents and often get html tags or their parts in the highlighter results, like:
Most often, these tags or tag pieces are:
I switched to using SentenceFragmenter (which is also more suitable for my needs):
so it should filter all that out, but it doesn't work. I even tried to escape those characters like this:
sentencechars='\<\/\>.!?'
Nope. It seems I will have to resort to additional search and replace.