Question about hardware required for model training

Hi,

Nice paper and nice code!

I want to try the code out on another dataset, but I'm running into some RAM related issues (i.e. not enough). Can you provide some information about the system that you used to train the models?

Information such as: Type of GPU? How much RAM? etc. How long did the training take?

Hi @schelv,

good questions! Could you please provide a few more details such as:

Size of dataset (e.g., number of records)
Flair version
What type of embeddings you are working with?
Are you using the pooled or non-pooled variant of the Flair embeddings?

There were some CUDA out of memory issues starting from flair >= 0.4.4. This issue provides some good suggestions: https://github.com/flairNLP/flair/issues/1270.

I provide the details of our system below. Please let me know should you need more info.

This is the hardware we used to train the BiLSTM-CRF models:

GPU: GeForce RTX 2080 Ti
GPU Memory: 10989 MiB
System Memory: 377GB
CPU: Intel Xeon Gold 6126 (48) @ 3.7GHz

And software/driver versions (more in environment.yml):

CUDA: 10.0.130
PyTorch: 1.3.0
Flair: 0.4.3
Python: 3.7.2
cudatoolkit: 10.0.130

As for the resource consumption during training: I don't have any numbers on the system's RAM usage. The 10GB GPU ram were sufficient to train model_bilstmcrf_ons_fast-v0.1.0 and model_bilstmcrf_ons_large-v0.1.0.

Training duration:

model_bilstmcrf_ons_fast-v0.1.0: 3 hours, 47 minutes
model_bilstmcrf_ons_large-v0.1.0: 7 hours, 51 minutes

You can download the model release files here: https://github.com/nedap/deidentify/releases. The model archives contain *.log files with timings and information about the model architectures.

Here's the architecture for the "large" model:

SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('nl')
    (list_embedding_1): PooledFlairEmbeddings(
      (context_embeddings): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.1)
          (encoder): Embedding(7632, 100)
          (rnn): LSTM(100, 2048)
          (decoder): Linear(in_features=2048, out_features=7632, bias=True)
        )
      )
    )
    (list_embedding_2): PooledFlairEmbeddings(
      (context_embeddings): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.1)
          (encoder): Embedding(7632, 100)
          (rnn): LSTM(100, 2048)
          (decoder): Linear(in_features=2048, out_features=7632, bias=True)
        )
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=8492, out_features=8492, bias=True)
  (rnn): LSTM(8492, 256, bidirectional=True)
  (linear): Linear(in_features=512, out_features=33, bias=True)
)

Parameters as reported in paper:

learning_rate: "0.1"
mini_batch_size: "32"
patience: "3"
anneal_factor: "0.5"
max_epochs: "150"
shuffle: "True"
train_with_dev: "True"

Hi thanks for the clear and fast response.

The size of the dataset looks pretty similar to the one that you used.

2020-02-28 14:11:34.889 | INFO     | __main__:main:103 - Loaded corpus: Corpus. Number of Documents (train/dev/test): 516/172/173
2020-02-28 14:11:34.889 | INFO     | __main__:main:107 - Get sentences...
2020-02-28 14:14:19.705 | INFO     | __main__:main:118 - FilteredCorpus(): train = 35323, dev = 12093, test = 10352. Ignored train/dev/test = 0/0/0

I installed using conda and the environment.yml file so most of the installed packages are the same. Except for PyTorch 1.3.1 instead of 1.3.0.

I'm also using a dutch dataset, so I've made small modifications in the code to use dutch word embeddings.

I tried to follow the training instructions as close as possible and trained the model with the --pooled_contextual_embeddings option. I did not use the --train_with_dev option. This produces almost the same model architecture as the "large" model.

Only these lines are different: (drop): Dropout(p=0.1, inplace=False) and (linear): Linear(in_features=512, out_features=16, bias=True). I'm guessing the last one is depending on the tag dictionary size?

The error is produced when evaluation the dev set. This happens since I do not use the --train_with_dev option.

I would like to look at the performance of the model on the dev set, so I do not want to use it for training at the moment.

The "fast" model looks a lot less heavy on the memory usage. So that's what I'm gonna try next. How is this model trained differently from the "large" model? (what arguments are used?) Is omitting the --pooled_contextual_embeddings option enough? I cannot find this in the paper or the github documentation?

Thanks for the help!

This is the stack trace of the error:

2020-02-28 18:19:49,253 EPOCH 1 done: loss 1.8945 - lr 0.1000
Traceback (most recent call last):
  File "deidentify/methods/bilstmcrf/run_bilstmcrf.py", line 176, in <module>
    main(ARGS)
  File "deidentify/methods/bilstmcrf/run_bilstmcrf.py", line 132, in main
    train_with_dev=args.train_with_dev)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/trainers/trainer.py", line 344, in train
    embeddings_storage_mode=embeddings_storage_mode,
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 268, in evaluate
    features = self.forward(batch)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 405, in forward
    self.embeddings.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 149, in embed
    embedding.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 81, in embed
    self._add_embeddings_internal(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 1917, in _add_embeddings_internal
    self.context_embeddings.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 81, in embed
    self._add_embeddings_internal(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 1822, in _add_embeddings_internal
    sentences_padded, self.chars_per_chunk
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/language_model.py", line 131, in get_representation
    prediction, rnn_output, hidden = self.forward(batch, hidden)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/language_model.py", line 78, in forward
    output, hidden = self.rnn(emb, hidden)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 564, in forward
    return self.forward_tensor(input, hx)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 543, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 526, in forward_impl
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 125304832 bytes. Error code 12 (Cannot allocate memory)

I'm also using a dutch dataset, so I've made small modifications in the code to use dutch word embeddings.

We could refactor the BiLSTM-CRF script to be more flexible. Currently, the embeddings are picked by corpus name, with Dutch embeddings being only used for the ons corpus. See:

https://github.com/nedap/deidentify/blob/e48de73c26f32d6d69dc2244530b44a84058500e/deidentify/methods/bilstmcrf/run_bilstmcrf.py#L43-L49

Only these lines are different: (drop): Dropout(p=0.1, inplace=False) and (linear): Linear(in_features=512, out_features=16, bias=True). I'm guessing the last one is depending on the tag dictionary size?

Indeed, the out_features=16 refers to the size of your tag set (with B/I/O labels).

The error is produced when evaluation the dev set. This happens since I do not use the --train_with_dev option.

I trained the models before with evaluation on the dev set, and it worked fine. So I doubt this step causes any issues.

The "fast" model looks a lot less heavy on the memory usage. So that's what I'm gonna try next. How is this model trained differently from the "large" model? (what arguments are used?) Is omitting the --pooled_contextual_embeddings option enough? I cannot find this in the paper or the github documentation?

For the "fast" model, we trained our own FlairEmbeddings, not the ones included in Flair. It uses a language model with a hidden layer size of 1024 instead of 2048 (default in Flair) and was trained on a Dutch corpus with 690m tokens (Wikipedia + OPUS). This significantly reduced the model size. Some notes I had on the benchmark:

Model	Parameters	F1	Precision	Recall
Pooled Flair Embeddings	158,088,510	0.8999	0.9259	0.8754
Flair Embeddings (no pooling)	96,906,558	0.8918	0.9133	0.8713
Flair Embeddings (no pooling, dutch-fast)	20,713,500	0.8878	0.9083	0.8683

The "Flair Embeddings (no pooling)" model was trained by omitting the --pooled_contextual_embeddings flag. To train the "Flair Embeddings (no pooling, dutch-fast)" model, we used the --contextual_backward_path and --contextual_forward_path parameters to pass in the custom embeddings.

If you are interested, I can share the embeddings with you. We might also contribute them to Flair at some point.

This is the stack trace of the error:

2020-02-28 18:19:49,253 EPOCH 1 done: loss 1.8945 - lr 0.1000
Traceback (most recent call last):
  File "deidentify/methods/bilstmcrf/run_bilstmcrf.py", line 176, in <module>
    main(ARGS)
  File "deidentify/methods/bilstmcrf/run_bilstmcrf.py", line 132, in main
    train_with_dev=args.train_with_dev)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/trainers/trainer.py", line 344, in train
    embeddings_storage_mode=embeddings_storage_mode,
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 268, in evaluate
    features = self.forward(batch)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 405, in forward
    self.embeddings.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 149, in embed
    embedding.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 81, in embed
    self._add_embeddings_internal(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 1917, in _add_embeddings_internal
    self.context_embeddings.embed(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 81, in embed
    self._add_embeddings_internal(sentences)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/embeddings.py", line 1822, in _add_embeddings_internal
    sentences_padded, self.chars_per_chunk
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/language_model.py", line 131, in get_representation
    prediction, rnn_output, hidden = self.forward(batch, hidden)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/flair/models/language_model.py", line 78, in forward
    output, hidden = self.rnn(emb, hidden)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 564, in forward
    return self.forward_tensor(input, hx)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 543, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "/home/user/miniconda3/envs/deidentify/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 526, in forward_impl
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 125304832 bytes. Error code 12 (Cannot allocate memory)

The error suggests that you are out of system RAM (not GPU memory). Could you check that? It happens during forward propagation in the second batch. In particular, during the embedding lookup. So not using pooled embeddings (i.e., omitting --pooled_contextual_embeddings) might very well help!

Another option could be the embedding_storage_mode of the flair.trainers.ModelTrainer: https://github.com/flairNLP/flair/pull/891

Curios to hear what your findings are.

I'm definitely interested in the embedding!

Reading back the initial questions that I asked, it kinda seems like I'm using a GPU. The error is definitely due to being out of RAM because I'm training on CPU.

The GPU question was related to the training time, and not the error. I could have made that more clear. =]

I will check if not using the pooled embedding helps with running out of RAM, and will let you know.

I'm definitely interested in the embedding!

If you could send me an email (see readme) I will share the embeddings 👍

Not using --pooled_contextual_embeddings solved the problem. The trained network works great! Thanks for everything.

nedap / deidentify

Question about hardware required for model training #14