Open meiwang-ut opened 6 years ago
This is correct: the entire dataset is moved to the GPU.
I think this is reasonable because recommender datasets are normally quite small: as you say yourself, the largest Movielens dataset occupies only ~1 gigabyte of GPU memory. In return, we get better performance by avoiding the need to move more data to the GPU on every minibatch.
If your data is very large you can split it into smaller parts and feed them one by one into the fit method, in effect doing batching yourself outside of the main model.
I found that the memory occupation on GPU is much larger than the theoretical embedding and model size. It seems the total sequence data got from Interactions.to_sequence method was somehow fed into memory, which is not reasonable. Anyone else notice this problem?
For example, we run the demo example movielens_sequence.py with the largest dataset '20M' to see the problem clearer. We can set parameters like: embedding_dim = 32, batch_size = 16, max_sequence_length = 200, item_num = 26,745 Actual memory usage on single GPU: Epoch 1: 789 MB Epoch 2: 1021 MB Epoch x later: 1021 MB
So theoretically, memory usage for embedding should be: item_num embedding_dim 4B = 26,745 32 4B = 3,423,360B = 3.4MB memory usage for model layers (LSTM/GRU) is hard to calculate precisely, but it should be not that large. After all, for one batch training data, it is only: batch_size embedding_dim max_sequence_length 4B = 16 32 200 4B = 409,600B = 409.6KB However, the size of the training sequence data is: item_num embedding_dim max_sequence_length 4B = 26,745 32 200 4B = 684,672,000B = 684MB
Let's see the problem with another set of parameters: embedding_dim = 256, batch_size = 256, max_sequence_length = 200, item_num = 26,745 Actual memory usage on single GPU: Epoch 1: 869 MB Epoch 2: 1101 MB Epoch x later: 1101 MB
We can see that, embedding_dim increases for 8 times, and batch_size increases for 16 times, the total memory usage only increases 80MB. So we can see that it probably the dataset was somehow fed into GPU memory.
The second thing is the memory will increase a lot after the first training epoch as shown above.
With these two problem, my own data cannot be trained on GPU with error 'OUT OF MEMORY'. Do you have any idea what's the problem and how to fix this?
Thanks a lot.