Provide up-to-date pre-processed text files

jowagner commented 4 years ago

The file Irish_Data > processed_ga_files_for_BERT_runs > train.txt mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a .tgz.

jowagner commented 4 years ago

If you have a .conllu version of these files ready that would save me a few minutes converting the files. I only need the first two columns populated as below. No worries if not.

$ head ga-common_crawl-000.conllu 
1       Iarscoláire     _       _       _       _       _       _       _       _
2       é       _       _       _       _       _       _       _       _
3       de      _       _       _       _       _       _       _       _
4       chuid   _       _       _       _       _       _       _       _
5       na      _       _       _       _       _       _       _       _
6       meánscoile      _       _       _       _       _       _       _       _
7       Coláiste        _       _       _       _       _       _       _       _
8       Eoin    _       _       _       _       _       _       _       SpaceAfter=No
9       .       _       _       _       _       _       _       _       _

jbrry commented 4 years ago

Yes this was the training corpus used for the first multilingual_bert run when we used this file to do continued pre-training of multilingual_bert. I added the training corpus to Google Drive so it could be accessed later for reproducibility. The pipeline has changed a lot since then, and this file would be considered obsolete - it is always excluded from subsequent runs.

I'm not sure if I can recover the exact configuration that was used to create this file but it was created before this repo existed, e.g. it was created using the scripts in https://github.com/jbrry/Irish-UD-Parsing and would have used a different version of UDPipe for tokenisation/segmentation. I also think I no longer have this repository on grove due to needing to free up space.

In any case, this file would have been created as follows (I will also add this to a README in the relevant folder on Google Drive).

Data is manually downloaded for conll17
conll17 data is tokenized/segmented using UDPipe.
conll17 and data on Google drive are combined into a single file (with some manual exclusion of, Paracrawl and NCI_Cleaned.
Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/text_processor.py on the input file from step 3.
Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh to break the input file into shards.
Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/run_pretraining.sh to do continued training of multilingual_bert using this training corpus.

jowagner commented 4 years ago

Sorry for not having been more clear. As we are highly unlikely to release a model based on this old file we don't need documentation for it. I meant "Please update this file to what is the input of our current best BERT model and then add a readme identifying this BERT model." Ideally, this should be repeated each time we have a new best model. Team members may want to use this file (or tgz archive of multiple files) to look for issues, train their own bert/roberta/xlm/elmo/fasttext/word2vec model or use the data for semi-supervised training of NLP components, e.g. tri-train a UD parser.

jbrry commented 4 years ago

Ok good idea. Yes, I intend to upload some sort of a corpus snapshot - e.g. the version of gdrive_file_list.csv which shows exactly which files were downloaded from Google Drive.

If you want these files as individual files, I can upload all of the individual .bz2 files which are stored in data/ga/<corpus>/raw/ by scripts/download_handler.py. Then we have access to all of the raw files we used prior to bucketing and the subsequent tokenisation/segmentation/filtering which takes place in wiki-bert-pipeline. In order to be fully deterministic, the filtering config files will be also included in the snapshot.

jowagner commented 4 years ago

How about referring to a commit via its hash such commit 04d4a12cd76b81553cc05cdf56dfe852c6e71b9f in the readme and including the options used with text_processor.py? This should cover everything needed, at least if we follow the suggestion in https://github.com/jbrry/Irish-BERT/issues/39#issuecomment-734368519 to stop using intermediate files on gdrive and instead carry out all processing with scripts in this repo starting from original files that do not change during the project.

BTW: I don't see create_pretraining_data.sh in the scripts folder.

jbrry commented 4 years ago

Would the most recent commit in the repo suffice? e.g.;

# take the first line from git log and print the hash
git log | head -n 1 | awk -F " " '{print $2}'

Yes the arguments supplied to text_processor.py should inform you of the datasets being used, e.g.:

python scripts/text_processor.py --datasets conll17 gdrive NCI oscar --bucket-size 100000000 --input-type raw --output-type processed

It just doesn't give you the list of files which were used from Google Drive, that can be found via:

cat data/ga/gdrive/gdrive_filelist.csv

BTW: I don't see create_pretraining_data.sh in the scripts folder.

Yes sorry given that the initial training corpus train.txt this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh

jowagner commented 4 years ago

Would the most recent commit in the repo suffice? e.g.;
# take the first line from git log and print the hash
git log | head -n 1 | awk -F " " '{print $2}'

https://stackoverflow.com/questions/949314/how-to-retrieve-the-hash-for-the-current-commit-in-git shows simpler ways.

It just doesn't give you the list of files which were used from Google Drive, that can be found via:
cat data/ga/gdrive/gdrive_filelist.csv
If this file was in the repo it would be covered by the commit. Is there anything sensitive in there? If you agree that it would be a good idea to move it into the repo let's check with Teresa whether the list of filenames can be published or must stay secret.

BTW: I don't see create_pretraining_data.sh in the scripts folder.

Yes sorry given that the initial training corpus train.txt this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh

Do we still need or use this script or is it obsolete?

jowagner commented 4 years ago

I created an issue for gdrive_filelist.csv. Assign it to Teresa if you agree it is a good idea. Otherwise, close issue #43 with "wont-fix" label.

jbrry / Irish-BERT

Provide up-to-date pre-processed text files #35