Open BobBorges opened 1 month ago
Could we just use this off-the-shelf https://huggingface.co/KB/bert-base-swedish-cased-ner It detects named entities and subclasses such as person.
def is_signature_block(elem, max_len=100):
text = " ".join(elem.text.split())
if len(text) >= max_len:
return False
else:
named_entities = ner(text)
persons = [ne for ne in named_entities if ne.get("entity") == "PER"]
return len(persons) >= 1
# [ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'PER' },
Stockholm den <nn> <month> <year>
seems to start the signature block, and nothing else, in every single motion.
Doesn't necessarily come as the first thing in the paragraph..
And susceptible OCR errors.
And not always the case. However, we can start with either approach if we want, and try and improve on that iteratively.
As you know, I generally prefer ML approaches since they are more robust to small random variations in formats, OCR errors etc. But this seems like an issue where we get 90%+ accuracy pretty easily regardless of the approach.
Doesn't necessarily come as the first thing in the paragraph..
no, but it should!
I'd like to see the paragraph end before the start of Stockholm, and the rest of it "div"ed off as the signature block.