ltgoslo / simple_elmo_training

Minimal code to train ELMo models in recent versions of TensorFlow
Apache License 2.0
14 stars 3 forks source link

Found 0 shards at --prefix #3

Closed Kvit closed 3 years ago

Kvit commented 3 years ago

I'm trying to run code in Google Colab. I keep getting error Found 0 shards at /content/cloud/train/ ( /content/cloud/train/ is my training data folder) I have a large number of text files in the directory named like train001.txt

How should I pass -- prefix argument correctly? /content/cloud/train/* /content/cloud/train/ /content/cloud/train

In the original implementation from https://github.com/allenai/bilm-tf this format works '/content/cloud/train/*' but here the same format gives error: unrecognized arguments: /content/cloud/train/train1000801.txt ..

Kvit commented 3 years ago

Here is output from Colab Cell

! python3 /content/simple_elmo_training/bilm/train_elmo.py --train_prefix /content/cloud/train/ --size $SIZE --vocab_file $VOCAB --save_dir $OUT

Note: I have added some code to train_elmo.py to print directory and some file list, you can see this test output in lines 2/3

2021-04-02 17:07:53.587102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
added: THIS IS PREFIX FOR TRAINING DATA  /content/cloud/train/
>> os.listdir(prefix)[0:10]: ['train1336486.txt', 'train687356.txt', 'train6216776.txt', 'train777706.txt', 'train1681206.txt', 'train576156.txt', 'train1687461.txt', 'train1402511.txt', 'train5291036.txt', 'train4666231.txt']
Found 0 shards at /content/cloud/train/
Traceback (most recent call last):
  File "/content/simple_elmo_training/bilm/train_elmo.py", line 86, in <module>
    main(arguments)
  File "/content/simple_elmo_training/bilm/train_elmo.py", line 62, in main
    data = BidirectionalLMDataset(prefix, vocab, test=False, shuffle_on_load=True)
  File "/content/simple_elmo_training/bilm/data.py", line 449, in __init__
    shuffle_on_load=shuffle_on_load)
  File "/content/simple_elmo_training/bilm/data.py", line 349, in __init__
    self._ids = self._load_random_shard()
  File "/content/simple_elmo_training/bilm/data.py", line 370, in _load_random_shard
    shard_name = self._choose_random_shard()
  File "/content/simple_elmo_training/bilm/data.py", line 355, in _choose_random_shard
    shard_name = self._shards_to_choose.pop()
IndexError: pop from empty list
Kvit commented 3 years ago

Is this the problem, that script expects only gzipped files? https://github.com/ltgoslo/simple_elmo_training/blob/afbdeefe85c81f0ce69b5c0da33cdcdcf8c0a7f9/bilm/data.py#L340

Kvit commented 3 years ago

i can confirm that changing line[340] above to self._all_shards = glob.glob(filepattern + '/*.*') solves the probem of not finind files, but creates another problem with type down further in the code

akutuzov commented 3 years ago

We usually train on compressed text files (this is faster and more convenient), that's why the code looks for *.gz files only. But it probably makes sense to also look for "*.txt" in addition to this. Can you create a pull request with this change?

akutuzov commented 3 years ago

It can be something like self._all_shards = glob.glob(filepattern + '*'). What other problems have you encountered with this fix?

Kvit commented 3 years ago

I have changed my data to gzip, encountered some reading errors that I could not figure out, and ended up using the original library to train the model on .txt files.

drvenabili commented 3 years ago

We usually train on compressed text files (this is faster and more convenient), that's why the code looks for *.gz files only.

Hi Andrey,

Do you have a benchmark somewhere on the speed difference between plain txt and gzipped? I can't seem to reproduce training speeds like the ones you report (despite using A100s), I'm currently wondering if the input format can have that big of an impact. Thanks!

akutuzov commented 3 years ago

The training speed is influenced by many factors, including batch size, the number of GPUs used, vocabulary size, LSTM layer dimensionality, etc. Whether the input files are compressed or not, is probably the least important of them. In our experience, one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs, batch size 192, vocabulary size around 100 000 and the LSTM dimensionality of 2048. Is what you are observing significantly different?

drvenabili commented 3 years ago

Thanks for replying so quickly! I had misread slide 17 here (https://www.uio.no/studier/emner/matnat/ifi/IN5550/v20/slides/11_contextualized_print.pdf), and did not see you mentioned the 24h run time was for 1 epoch and not the three by default 3.

Is what you are observing significantly different?

I'm not sure, but I'll pull statistics once the models are done training. I'm using the same parameters as you, except double the batch size because I am using 2 A100 with 40GB each.

drvenabili commented 3 years ago

Hi @akutuzov !

Here are the stats. Let me know if you want me to put them somewhere else/better format them. Perhaps we could have some "expected training time" in the README?

For the record the models are trained on Alvis, Phase 1c.

Hardware:

Software:

Parameters:

  1. python bilm/train_elmo.py --train_prefix corpus/ --size 1015635151 --vocab_file data/vocab.txt, where
    1. vocab.txt has 10,003 words
    2. corpus/ contains 134 files, each containing 500k sentences
  2. batch size = 384
  3. dim = 2048
  4. 3 epochs

Run times and performance:

Logs are attached. It is now obvious I can really augment the batch size considerably, given the very low memory usage -- it was hard to estimate given nvidia-smi almost always reports 100%.

If you have other ideas I'm interested!

dcgm-gpu-stats-alvis2-21-jobid-67121.txt slurm-67121.txt

akutuzov commented 3 years ago

Hi @faustusdotbe

I actually have some questions and comments for that, but can you start a new issue or pull request?

This (closed) issue about file formats is not the best place for such discussions :)