Closed cramraj8 closed 2 years ago
Try esp as the language code for Spanish.
On Wed, May 18, 2022 at 9:46 PM Ramraj Chandradevan < @.***> wrote:
I am trying to use other CLEF language collections (e.g Spanish). But Patapsco never threw me error on fas, rus, and zho languages. For Spanish (or other languages), when I try to set 'lang' field as either 'es' or 'spa' in all topics file, doc file, and config file, I always end up getting error Error: Unknown language code: es Am I missing something from the config format or Patapsco only supports those 3 languages ? How do I solve the issue of language independent monolingual retrieval using Patapsco ?
— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/39, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOQHSC6OZTIP27STMNTVKWMNZANCNFSM5WKLEERA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/
We use the ISO 639-3 language codes. I'll make that more explicit in the warning and documentation. If you continue to have a problem, please post back here.
I still get the error Error: Unknown language code: esp
Are those ISO 639-3 language codes are pre-listed in the files ?
Odd. We use a python package to handle our language codes. Code here:
@classmethod
def language(cls, code):
lang = None
if len(code) == 2:
lang = pycountry.languages.get(alpha_2=code)
elif len(code) == 3:
lang = pycountry.languages.get(alpha_3=code)
if lang is None:
raise ConfigError(f"Unknown language code: {code}")
return
I'll take a look at this later today and post back what I find.
@cash Any luck on reproducing the error ? I used both the Gitlab and GitHub Patapsco versions, but still getting the error.
@cramraj8 I've been able to reproduce the issue and am looking into it.
I fixed an issue with the normalizer only supporting the languages that we used in the summer workshop. That change is merged into master. I will test a full spanish pipeline later. The iso 639-3 code is spa not esp. Again, I'll improve documentation there.
I am still getting error at document processing
2022-05-23 10:23:13,856 - patapsco.job - INFO - Stage 1: Starting processing of documents
Traceback (most recent call last):
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/bin/patapsco", line 8, in <module>
sys.exit(main())
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/bin/main.py", line 19, in main
runner.run()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/run.py", line 41, in run
self.job.run(sub_job=sub_job_flag)
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/job.py", line 91, in run
report = self._run()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/job.py", line 144, in _run
self.stage1.run()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 166, in run
self.begin()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 140, in begin
task.begin()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 100, in begin
self.task.begin()
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/text.py", line 524, in begin
self.normalizer = NormalizerFactory.create(self.lang, self.processor_config.normalize)
File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/util/normalize.py", line 198, in create
raise ValueError(f"Unknown language: {lang}")
ValueError: Unknown language: spa
@cramraj8 Did you get the latest from master and did you install with the editable flag? If you did not install with that flag, you'll need to do "pip install ." again to get the latest in your environment.
Thanks, after installing via 'pip install -e .' it works. However, for the error of "not finding spa.txt file, I had to create an empty txt file. Is it recommended to use a Spanish stop words file under patapsco/resources/lucene/spa.txt ? If so do you know where I can get such stop words list ?
Looks like I need to add a guide for running Patapsco on a language other than the ones built in! Lucene pulls there Spanish stop words from snowball: https://github.com/snowballstem/snowball-website/blob/master/algorithms/spanish/stop.txt I'll go through and add common languages so users are less likely to to hit this issue. I'll also put in a better check for that error. Thanks for all the reports.
Great, thanks! Glad that Patapsco is now supporting less common languages.
I added spa.txt just now and will add more this week.
I am trying to use other CLEF language collections (e.g Spanish). But Patapsco never threw me error on fas, rus, and zho languages. For Spanish (or other languages), when I try to set 'lang' field as either 'es' or 'spa' in all topics file, doc file, and config file, I always end up getting error
Error: Unknown language code: es
Am I missing something from the config format or Patapsco only supports those 3 languages ? How do I solve the issue of language independent monolingual retrieval using Patapsco ?