hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Error: Unknown language code: es #39

Closed cramraj8 closed 2 years ago

cramraj8 commented 2 years ago

I am trying to use other CLEF language collections (e.g Spanish). But Patapsco never threw me error on fas, rus, and zho languages. For Spanish (or other languages), when I try to set 'lang' field as either 'es' or 'spa' in all topics file, doc file, and config file, I always end up getting error Error: Unknown language code: es Am I missing something from the config format or Patapsco only supports those 3 languages ? How do I solve the issue of language independent monolingual retrieval using Patapsco ?

dlawrie commented 2 years ago

Try esp as the language code for Spanish.

On Wed, May 18, 2022 at 9:46 PM Ramraj Chandradevan < @.***> wrote:

I am trying to use other CLEF language collections (e.g Spanish). But Patapsco never threw me error on fas, rus, and zho languages. For Spanish (or other languages), when I try to set 'lang' field as either 'es' or 'spa' in all topics file, doc file, and config file, I always end up getting error Error: Unknown language code: es Am I missing something from the config format or Patapsco only supports those 3 languages ? How do I solve the issue of language independent monolingual retrieval using Patapsco ?

— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/39, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOQHSC6OZTIP27STMNTVKWMNZANCNFSM5WKLEERA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/

cash commented 2 years ago

We use the ISO 639-3 language codes. I'll make that more explicit in the warning and documentation. If you continue to have a problem, please post back here.

cramraj8 commented 2 years ago

I still get the error Error: Unknown language code: esp Are those ISO 639-3 language codes are pre-listed in the files ?

cash commented 2 years ago

Odd. We use a python package to handle our language codes. Code here:

    @classmethod
    def language(cls, code):
        lang = None
        if len(code) == 2:
            lang = pycountry.languages.get(alpha_2=code)
        elif len(code) == 3:
            lang = pycountry.languages.get(alpha_3=code)
        if lang is None:
            raise ConfigError(f"Unknown language code: {code}")
        return 

I'll take a look at this later today and post back what I find.

cramraj8 commented 2 years ago

@cash Any luck on reproducing the error ? I used both the Gitlab and GitHub Patapsco versions, but still getting the error.

cash commented 2 years ago

@cramraj8 I've been able to reproduce the issue and am looking into it.

cash commented 2 years ago

I fixed an issue with the normalizer only supporting the languages that we used in the summer workshop. That change is merged into master. I will test a full spanish pipeline later. The iso 639-3 code is spa not esp. Again, I'll improve documentation there.

cramraj8 commented 2 years ago

I am still getting error at document processing

2022-05-23 10:23:13,856 - patapsco.job - INFO - Stage 1: Starting processing of documents
Traceback (most recent call last):
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/bin/patapsco", line 8, in <module>
    sys.exit(main())
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/bin/main.py", line 19, in main
    runner.run()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/run.py", line 41, in run
    self.job.run(sub_job=sub_job_flag)
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/job.py", line 91, in run
    report = self._run()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/job.py", line 144, in _run
    self.stage1.run()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 166, in run
    self.begin()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 140, in begin
    task.begin()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/pipeline.py", line 100, in begin
    self.task.begin()
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/text.py", line 524, in begin
    self.normalizer = NormalizerFactory.create(self.lang, self.processor_config.normalize)
  File "/home/hltcoe/rchandradevan/.conda/envs/patapsco-github/lib/python3.9/site-packages/patapsco/util/normalize.py", line 198, in create
    raise ValueError(f"Unknown language: {lang}")
ValueError: Unknown language: spa
cash commented 2 years ago

@cramraj8 Did you get the latest from master and did you install with the editable flag? If you did not install with that flag, you'll need to do "pip install ." again to get the latest in your environment.

cramraj8 commented 2 years ago

Thanks, after installing via 'pip install -e .' it works. However, for the error of "not finding spa.txt file, I had to create an empty txt file. Is it recommended to use a Spanish stop words file under patapsco/resources/lucene/spa.txt ? If so do you know where I can get such stop words list ?

cash commented 2 years ago

Looks like I need to add a guide for running Patapsco on a language other than the ones built in! Lucene pulls there Spanish stop words from snowball: https://github.com/snowballstem/snowball-website/blob/master/algorithms/spanish/stop.txt I'll go through and add common languages so users are less likely to to hit this issue. I'll also put in a better check for that error. Thanks for all the reports.

cramraj8 commented 2 years ago

Great, thanks! Glad that Patapsco is now supporting less common languages.

cash commented 2 years ago

I added spa.txt just now and will add more this week.