thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

ELRC-euipo_law-1-eng-fra hits 403 (forbidden) #97

Closed XapaJIaMnu closed 2 years ago

XapaJIaMnu commented 2 years ago

Output:

mtdata get -l fr-en -tr ELRC-euipo_law-1-eng-fra -o /mnt/nanna0/nbogoych/data/data/fr-en/fr-en-prod/original/corpus/mtdata/ELRC-euipo_law-1-eng-fra --compress
2022-01-28 14:56:07 entry.lang_pair:24 INFO:: Suggestion: Use codes fra-eng instead of fr-en. Let's make a little space for all languages of our planet 😢.
2022-01-28 14:56:07 main.get_data:30 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-01-28 14:56:07 __init__.get_instance:48 INFO:: Loading index from cache /home/s1031254/.mtdata/mtdata.index.0.3.2.pkl
2022-01-28 14:56:09 cache.__post_init__:34 INFO:: Local cache is at /home/s1031254/.mtdata
2022-01-28 14:56:09 cache.download:135 INFO:: Acquiring lock on /home/s1031254/.mtdata/elrc-share.eu/408a/e3d8cb7068b349a2a76d96f9fda4/ELRC_1075.zip._lock
2022-01-28 14:56:09 cache.download:140 INFO:: GET https://elrc-share.eu/repository/download/aac3b38c03ea11e9b7d400155d026706813f548412e94fc185f1ee9a17c5d5b0/ → /home/s1031254/.mtdata/elrc-share.eu/408a/e3d8cb7068b349a2a76d96f9fda4/ELRC_1075.zip
Traceback (most recent call last):
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/bin/mtdata", line 8, in <module>
    sys.exit(main())
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/__main__.py", line 9, in main
    main.main()
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/main.py", line 208, in main
    get_data(**vars(args))
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/main.py", line 33, in get_data
    dataset = Dataset.prepare(
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/data.py", line 89, in prepare
    dataset.add_train_entries(train_entries, merge_train=merge_train, compress=compress,
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/data.py", line 105, in add_train_entries
    self.add_parts(self.train_parts_dir, entries, drop_noise=self.drop_train_noise,
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/data.py", line 268, in add_parts
    n_good, n_bad = self.add_part(dir_path=dir_path, entry=ent, drop_noise=drop_noise,
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/data.py", line 293, in add_part
    path = self.cache.get_entry(entry)
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/cache.py", line 40, in get_entry
    local = self.get_local_path(entry.url, filename=entry.filename, fix_missing=fix_missing)
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/cache.py", line 96, in get_local_path
    self.download(url, local)
  File "/mnt/nanna0/nbogoych/firefox-translations-training/.snakemake/conda/439ad8e754bf2e3a1e62efafc89947ed/lib/python3.9/site-packages/mtdata/cache.py", line 142, in download
    assert resp.status_code == 200, resp.status_code
AssertionError: 403
kpu commented 2 years ago

This one is weird.

Download disappeared for https://elrc-share.eu/repository/browse/euipo-ip-case-law-french-english-processed/aac3b38c03ea11e9b7d400155d026706813f548412e94fc185f1ee9a17c5d5b0/

but it's still available for other languages: https://elrc-share.eu/repository/browse/euipo-ip-case-law-german-english-processed/6652685203ee11e9b7d400155d0267062b5a56242402418190a3abbfee156e4c/

They also changed the license from publicDomain to CC-BY-NC-ND-4.0 for the remaining ones.

Asked Victoria what's up.

I have copied 1075.zip to nanna:~heafield if you want, but will delete it from the ELRC list.

kpu commented 2 years ago

Closing as fixed because it's removed.