hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Character normalization for accented characters #46

Open eugene-yang opened 2 years ago

eugene-yang commented 2 years ago

Characters with accented characters should be normalized for better matching. Or at least should be an option for user to select.

Here is an example from CLEF.

{"id": "xxx", "lang": "eng", "query": "escap rold\u00e1n escap flight spanish civil guard - director lui rold\u00e1n", "text": "escape rold\u00e1n escape flight spanish civil guard-director luis rold\u00e1n", "report": null} 
cash commented 2 years ago

Do you mean normalize diacritics or remove them? The current code normalizes them: https://github.com/hltcoe/patapsco/blob/master/patapsco/util/normalize.py#L240

dlawrie commented 1 year ago

I think we want to remove diacritics.

On Thu, Jun 23, 2022 at 11:45 AM Cash Costello @.***> wrote:

Do you mean normalize diacritics or remove them? The current code normalizes them: https://github.com/hltcoe/patapsco/blob/master/patapsco/util/normalize.py#L240

— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/46#issuecomment-1164641426, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOVRQSK25MZVE2W2F3DVQSIDJANCNFSM5ZU2D5TA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/