thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

ELRC-portal_oficial_turismo_españa_www.spain.info-1-eng-por doesn't contain eng-por #101

Closed XapaJIaMnu closed 2 years ago

XapaJIaMnu commented 2 years ago
mtdata get -l pt-en -tr ELRC-portal_oficial_turismo_españa_www.spain.info-1-eng-por -o /newdata/data/pt-en/pt-en-prod/original/corpus/mtdata/ELRC-portal_oficial_turismo_españa_www.spain.info-1-eng-por --compress
2022-01-31 08:47:51 entry.lang_pair:24 INFO:: Suggestion: Use codes por-eng instead of pt-en. Let's make a little space for all languages of our planet 😢.
2022-01-31 08:47:51 main.get_data:30 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-01-31 08:47:51 __init__.get_instance:48 INFO:: Loading index from cache /.mtdata/mtdata.index.0.3.2.pkl
2022-01-31 08:47:52 cache.__post_init__:34 INFO:: Local cache is at /.mtdata
2022-01-31 08:47:52 cache.download:135 INFO:: Acquiring lock on /.mtdata/elrc-share.eu/390f/df5ff9dd99d0561e1b47503914b5/ELRC_2410.zip._lock
2022-01-31 08:47:52 cache.download:140 INFO:: GET https://elrc-share.eu/repository/download/04dfcaec9ca011e9a7e100155d02670640ce598203e246e8ae4fe838be231213/ → /.mtdata/elrc-share.eu/390f/df5ff9dd99d0561e1b47503914b5/ELRC_2410.zip
Traceback (most recent call last):
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/bin/mtdata", line 8, in <module>
    sys.exit(main())
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/__main__.py", line 9, in main
    main.main()
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/main.py", line 208, in main
    get_data(**vars(args))
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/main.py", line 33, in get_data
    dataset = Dataset.prepare(
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/data.py
", line 89, in prepare
    dataset.add_train_entries(train_entries, merge_train=merge_train, compress=compress,
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/data.py", line 105, in add_train_entries
    self.add_parts(self.train_parts_dir, entries, drop_noise=self.drop_train_noise,
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/data.py", line 268, in add_parts
    n_good, n_bad = self.add_part(dir_path=dir_path, entry=ent, drop_noise=drop_noise,
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/data.py", line 302, in add_part
    for rec in parser.read_segs():
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/parser.py", line 101, in read_segs
    for rec in data:
  File "/firefox-translations-training-new/.snakemake/conda/a7d2ecbe04b725965f388f20197153b8/lib/python3.9/site-packages/mtdata/tmx.py", line 73, in read_tmx
    raise Exception(f"Nothing for {langs[0]}-{langs[1]} in TMX {path}")
Exception: Nothing for eng-por in TMX ZipPath(root=PosixPath('/.mtdata/elrc-share.eu/390f/df5ff9dd99d0561e1b47503914b5/ELRC_2410.zip'), name='archive/Spain.info/20190529_OldProvider_en-GB_pt-PT_clean.tmx')
kpu commented 2 years ago

This is fixed in the develop branch with the following command:

mtdata get -l pt-en -tr ELRC-portal_oficial_turismo_españa_www.spain.info-1-eng_GB-por_PT -o . --compress
thammegowda commented 2 years ago

This issue has been fixed! (Tested on 0.3.6)