Closed Kvit closed 3 years ago
Here is output from Colab Cell
! python3 /content/simple_elmo_training/bilm/train_elmo.py --train_prefix /content/cloud/train/ --size $SIZE --vocab_file $VOCAB --save_dir $OUT
Note: I have added some code to train_elmo.py to print directory and some file list, you can see this test output in lines 2/3
2021-04-02 17:07:53.587102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
added: THIS IS PREFIX FOR TRAINING DATA /content/cloud/train/
>> os.listdir(prefix)[0:10]: ['train1336486.txt', 'train687356.txt', 'train6216776.txt', 'train777706.txt', 'train1681206.txt', 'train576156.txt', 'train1687461.txt', 'train1402511.txt', 'train5291036.txt', 'train4666231.txt']
Found 0 shards at /content/cloud/train/
Traceback (most recent call last):
File "/content/simple_elmo_training/bilm/train_elmo.py", line 86, in <module>
main(arguments)
File "/content/simple_elmo_training/bilm/train_elmo.py", line 62, in main
data = BidirectionalLMDataset(prefix, vocab, test=False, shuffle_on_load=True)
File "/content/simple_elmo_training/bilm/data.py", line 449, in __init__
shuffle_on_load=shuffle_on_load)
File "/content/simple_elmo_training/bilm/data.py", line 349, in __init__
self._ids = self._load_random_shard()
File "/content/simple_elmo_training/bilm/data.py", line 370, in _load_random_shard
shard_name = self._choose_random_shard()
File "/content/simple_elmo_training/bilm/data.py", line 355, in _choose_random_shard
shard_name = self._shards_to_choose.pop()
IndexError: pop from empty list
Is this the problem, that script expects only gzipped files? https://github.com/ltgoslo/simple_elmo_training/blob/afbdeefe85c81f0ce69b5c0da33cdcdcf8c0a7f9/bilm/data.py#L340
i can confirm that changing line[340] above to self._all_shards = glob.glob(filepattern + '/*.*')
solves the probem of not finind files, but creates another problem with type down further in the code
We usually train on compressed text files (this is faster and more convenient), that's why the code looks for *.gz
files only.
But it probably makes sense to also look for "*.txt" in addition to this. Can you create a pull request with this change?
It can be something like self._all_shards = glob.glob(filepattern + '*')
. What other problems have you encountered with this fix?
I have changed my data to gzip, encountered some reading errors that I could not figure out, and ended up using the original library to train the model on .txt files.
We usually train on compressed text files (this is faster and more convenient), that's why the code looks for
*.gz
files only.
Hi Andrey,
Do you have a benchmark somewhere on the speed difference between plain txt and gzipped? I can't seem to reproduce training speeds like the ones you report (despite using A100s), I'm currently wondering if the input format can have that big of an impact. Thanks!
The training speed is influenced by many factors, including batch size, the number of GPUs used, vocabulary size, LSTM layer dimensionality, etc. Whether the input files are compressed or not, is probably the least important of them. In our experience, one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs, batch size 192, vocabulary size around 100 000 and the LSTM dimensionality of 2048. Is what you are observing significantly different?
Thanks for replying so quickly! I had misread slide 17 here (https://www.uio.no/studier/emner/matnat/ifi/IN5550/v20/slides/11_contextualized_print.pdf), and did not see you mentioned the 24h run time was for 1 epoch and not the three by default 3.
Is what you are observing significantly different?
I'm not sure, but I'll pull statistics once the models are done training. I'm using the same parameters as you, except double the batch size because I am using 2 A100 with 40GB each.
Hi @akutuzov !
Here are the stats. Let me know if you want me to put them somewhere else/better format them. Perhaps we could have some "expected training time" in the README?
For the record the models are trained on Alvis, Phase 1c.
Hardware:
Software:
Parameters:
python bilm/train_elmo.py --train_prefix corpus/ --size 1015635151 --vocab_file data/vocab.txt
, where
vocab.txt
has 10,003 wordscorpus/
contains 134 files, each containing 500k sentencesRun times and performance:
Logs are attached. It is now obvious I can really augment the batch size considerably, given the very low memory usage -- it was hard to estimate given nvidia-smi
almost always reports 100%.
If you have other ideas I'm interested!
Hi @faustusdotbe
I actually have some questions and comments for that, but can you start a new issue or pull request?
This (closed) issue about file formats is not the best place for such discussions :)
I'm trying to run code in Google Colab. I keep getting error
Found 0 shards at /content/cloud/train/
(/content/cloud/train/
is my training data folder) I have a large number of text files in the directory named liketrain001.txt
How should I pass -- prefix argument correctly? /content/cloud/train/* /content/cloud/train/ /content/cloud/train
In the original implementation from https://github.com/allenai/bilm-tf this format works
'/content/cloud/train/*'
but here the same format gives error:unrecognized arguments: /content/cloud/train/train1000801.txt ..