stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.27k stars 891 forks source link

Correct misspelled entities #1261

Closed mirix closed 1 year ago

mirix commented 1 year ago

Hello,

I am working with transcribed text. The general quality of the transcription is excellent but it contains a large number of misspelled entity names that are crucial for me.

For instance "Swisscode" or "Swiss Gold" instead of the correct "Swissquote" or "Alex Hormital" instead of the correct "ArcelorMittal".

Prior to reinventing the wheel, I was wondering if anyone would be aware of an existing NER tool that could correct such mistakes?

Otherwise, any tips on the best approach for implementing such solution would be greatly appreciated.

Best,

Ed

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa commented 1 year ago

Sorry, but we don't have any tool like that as part of Stanza. You could possibly retrain the lemmatizer or some similar tool to address that, but the current lemmatizer doesn't take into account neighboring context, so performance on a misspelling with two ambiguous corrections would be unsatisfactory.