hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Add Korean, Indonesian, and Hebrew support #47

Open dlawrie opened 1 year ago

dlawrie commented 1 year ago

Support the above languages in Patapsco

isoboroff commented 1 year ago

I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure.

If I had a Korean IR test collection I wouldn't be asking this question ;-)

cash commented 1 year ago

Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2

So short answer is that you can set the language code and Patapsco should just work for Korean.

I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting.

cash commented 1 year ago

Sorry @isoboroff - forgot to tag you in my response

isoboroff commented 1 year ago

Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches.

NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already.