Open jowagner opened 4 years ago
If you have a .conllu
version of these files ready that would save me a few minutes converting the files. I only need the first two columns populated as below. No worries if not.
$ head ga-common_crawl-000.conllu
1 Iarscoláire _ _ _ _ _ _ _ _
2 é _ _ _ _ _ _ _ _
3 de _ _ _ _ _ _ _ _
4 chuid _ _ _ _ _ _ _ _
5 na _ _ _ _ _ _ _ _
6 meánscoile _ _ _ _ _ _ _ _
7 Coláiste _ _ _ _ _ _ _ _
8 Eoin _ _ _ _ _ _ _ SpaceAfter=No
9 . _ _ _ _ _ _ _ _
Yes this was the training corpus used for the first multilingual_bert
run when we used this file to do continued pre-training of multilingual_bert
. I added the training corpus to Google Drive so it could be accessed later for reproducibility. The pipeline has changed a lot since then, and this file would be considered obsolete - it is always excluded from subsequent runs.
I'm not sure if I can recover the exact configuration that was used to create this file but it was created before this repo existed, e.g. it was created using the scripts in https://github.com/jbrry/Irish-UD-Parsing and would have used a different version of UDPipe for tokenisation/segmentation. I also think I no longer have this repository on grove
due to needing to free up space.
In any case, this file would have been created as follows (I will also add this to a README in the relevant folder on Google Drive).
conll17
conll17
data is tokenized/segmented using UDPipe.conll17
and data on Google drive are combined into a single file (with some manual exclusion of, Paracrawl
and NCI_Cleaned
.multilingual_bert
using this training corpus.Sorry for not having been more clear. As we are highly unlikely to release a model based on this old file we don't need documentation for it. I meant "Please update this file to what is the input of our current best BERT model and then add a readme identifying this BERT model." Ideally, this should be repeated each time we have a new best model. Team members may want to use this file (or tgz archive of multiple files) to look for issues, train their own bert/roberta/xlm/elmo/fasttext/word2vec model or use the data for semi-supervised training of NLP components, e.g. tri-train a UD parser.
Ok good idea. Yes, I intend to upload some sort of a corpus snapshot - e.g. the version of gdrive_file_list.csv
which shows exactly which files were downloaded from Google Drive.
If you want these files as individual files, I can upload all of the individual .bz2
files which are stored in data/ga/<corpus>/raw/
by scripts/download_handler.py
. Then we have access to all of the raw files we used prior to bucketing and the subsequent tokenisation/segmentation/filtering which takes place in wiki-bert-pipeline
. In order to be fully deterministic, the filtering config files will be also included in the snapshot.
How about referring to a commit via its hash such commit 04d4a12cd76b81553cc05cdf56dfe852c6e71b9f in the readme and including the options used with text_processor.py
? This should cover everything needed, at least if we follow the suggestion in https://github.com/jbrry/Irish-BERT/issues/39#issuecomment-734368519 to stop using intermediate files on gdrive and instead carry out all processing with scripts in this repo starting from original files that do not change during the project.
BTW: I don't see create_pretraining_data.sh
in the scripts folder.
Would the most recent commit in the repo suffice? e.g.;
# take the first line from git log and print the hash
git log | head -n 1 | awk -F " " '{print $2}'
Yes the arguments supplied to text_processor.py
should inform you of the datasets being used, e.g.:
python scripts/text_processor.py --datasets conll17 gdrive NCI oscar --bucket-size 100000000 --input-type raw --output-type processed
It just doesn't give you the list of files which were used from Google Drive, that can be found via:
cat data/ga/gdrive/gdrive_filelist.csv
BTW: I don't see create_pretraining_data.sh in the scripts folder.
Yes sorry given that the initial training corpus train.txt
this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing
: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh
Would the most recent commit in the repo suffice? e.g.;
# take the first line from git log and print the hash git log | head -n 1 | awk -F " " '{print $2}'
https://stackoverflow.com/questions/949314/how-to-retrieve-the-hash-for-the-current-commit-in-git shows simpler ways.
It just doesn't give you the list of files which were used from Google Drive, that can be found via:
cat data/ga/gdrive/gdrive_filelist.csv
If this file was in the repo it would be covered by the commit. Is there anything sensitive in there? If you agree that it would be a good idea to move it into the repo let's check with Teresa whether the list of filenames can be published or must stay secret.
BTW: I don't see create_pretraining_data.sh in the scripts folder.
Yes sorry given that the initial training corpus
train.txt
this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing
: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh
Do we still need or use this script or is it obsolete?
I created an issue for gdrive_filelist.csv
. Assign it to Teresa if you agree it is a good idea. Otherwise, close issue #43 with "wont-fix" label.
The file
Irish_Data > processed_ga_files_for_BERT_runs > train.txt
mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a.tgz
.