swerik-project / riksdagen-motions

0 stars 1 forks source link

detect signature blocks in all motions #10

Open BobBorges opened 1 month ago

BobBorges commented 1 month ago
ninpnin commented 15 hours ago

Could we just use this off-the-shelf https://huggingface.co/KB/bert-base-swedish-cased-ner It detects named entities and subclasses such as person.

ninpnin commented 14 hours ago
def is_signature_block(elem, max_len=100):
    text = " ".join(elem.text.split())
    if len(text) >= max_len:
        return False
    else:
         named_entities = ner(text)
         persons = [ne for ne in named_entities if ne.get("entity") == "PER"]
         return len(persons) >= 1
          # [ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'PER' },
BobBorges commented 14 hours ago

Stockholm den <nn> <month> <year> seems to start the signature block, and nothing else, in every single motion.

ninpnin commented 13 hours ago
Näyttökuva 2024-11-28 kello 17 24 30

Doesn't necessarily come as the first thing in the paragraph..

ninpnin commented 13 hours ago
Näyttökuva 2024-11-28 kello 17 26 05

And susceptible OCR errors.

ninpnin commented 13 hours ago
Näyttökuva 2024-11-28 kello 17 30 05

And not always the case. However, we can start with either approach if we want, and try and improve on that iteratively.

As you know, I generally prefer ML approaches since they are more robust to small random variations in formats, OCR errors etc. But this seems like an issue where we get 90%+ accuracy pretty easily regardless of the approach.

BobBorges commented 12 hours ago

Doesn't necessarily come as the first thing in the paragraph..

no, but it should!

I'd like to see the paragraph end before the start of Stockholm, and the rest of it "div"ed off as the signature block.