Open thammegowda opened 2 years ago
Also, note that there is a bit of inconsistency inside zipping of item (3) as well.
$ unzip oneindia_20210320_en_ml.zip
Archive: oneindia_20210320_en_ml.zip
creating: en-ml/
inflating: en-ml/oneindia_train.ml
inflating: en-ml/oneindia_train.en
$ unzip pibarchives_2014_2016_en_ml.zip
Archive: pibarchives_2014_2016_en_ml.zip
inflating: en-ml/.DS_Store
inflating: __MACOSX/en-ml/._.DS_Store
inflating: en-ml/pib_arch_train.en
inflating: en-ml/pib_arch_train.ml
$ unzip wikipedia-en-ml-20210201.zip
Archive: wikipedia-en-ml-20210201.zip
inflating: en-ml/ml.txt
inflating: en-ml/en.txt
en-ml/wikipedia_train.{en,ml}
could have made scripts/automation tools simple to write.
I added these datasets to v0.3.2
pip install -I mtdata==0.3.2
P.S. https://github.com/thammegowda/mtdata/blob/master/mtdata/index/anuvaad.py
Hi,
Thanks for your efforts in creating/curating these datasets! These are priceless and greatly advance NLP for Indian languages.
I tried adding them into
mtdata
https://github.com/thammegowda/mtdata/issues/81 Since the README says your datasets are still growing, I am wondering whats the best long-term strategy is for keeping in sync.For now, I can
grep -i -o 'http[^ ]*zip' README.md
, but the immediate concern is about consistency in determining name, version, and languages of datasets from URL.The way current files are named (which act as ID for corpus) is a bit inconsistent. For example, consider these:
_
and get(name, version, lang1, lang2)
, so this is great. we can seeoneindia
is the name,20210320
is the version, anden_ml
are langs.2014_2016
as version, though it would have been nice to have2014to2016v1
as version. so splitting by_
would give exactly(name, version, lang1, lang2)
as in item 1.(name, version, lang1, lang2)
. There are more datasets matching item (1) than item (3) pattern, so I am inclined to call this abnormal.Could you please consider having a consistent format in dataset IDs? It'd greatly help the automated downloaders such as
mtdata
.Otherwise, do you really want your users to manually download 196 zip files via browser, and extract and merge them? :)
Thanks.
P.S https://github.com/thammegowda/mtdata#dataset-id