Possible memory leak during iteration for large number users (10+ millions)?

lightsailpro commented 2 years ago

I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.

pmixer commented 2 years ago

hi @lightsailpro , thx for the feedback, it's quite large dataset, sorry we do not focus on cpu based training before, according to your description, very likely, the Queue based sampler https://github.com/pmixer/SASRec.pytorch/blob/master/utils.py may consumes RAM incrementally if we do not elaborate into RAM usage of sub processes, personally, I'll recommend adjusting some parameters of the sampler to try to finish the experiment, also, if interested, pls feel free to enhance the sampler concerning its RAM use, it would benefit users of the repo and original official work https://github.com/kang205/SASRec

lightsailpro commented 2 years ago

@pmixer Thanks for your response. The training is still done on a V100 GPU with 16GB RAM. GPU RAM is fine even with larger batch size. The issue is the CPU RAM consumption keep increasing as the train progress from epoch to epoch.

pmixer commented 2 years ago

@lightsailpro maybe try original repo https://github.com/kang205/SASRec, tensorflow may have better support than pytorch when cpu is used for training and inference.

NicholasLea commented 2 years ago

@lightsailpro @pmixer I think the reason is that "[train, valid, test, usernum, itemnum] = copy.deepcopy(dataset)t" causes more and more RAM use. I have closely checked and think we can remove copy.deepcopy. The copy.deepcopy is used to avoid changing train, valid, test. I think existing processing, incluing reversed(train[u]) won't change them. I also tested the removing. It works. If you have more finding, please comment on. Thanks.

lightsailpro commented 2 years ago

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88 loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB) ..... loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

pmixer commented 2 years ago

@NicholasLea : Sorry for delayed response. Thanks for the help! I assume the copy.deepcopy you mentioned is in the evaluate code. To clarify, the host RAM leak / consumption observed is in epoch 1 iteration process, in which evaluate code is not called yet (epoch % 20). In my case, in epoch 1 iteration 0 when the training started, the main.py consumed about 36GB host RAM. But in iteration 7750, the host RAM consumption already jumped to 50GB. So I was not even able to finish epoch 1 before the host run out of RAM. The V100 GPU consumption is very stable though (around 4GB out of 16GB). Any further help will be appreciated.

average sequence length: 34.88 loss in epoch 1 iteration 0 / 47471: 1.38626229763031 (host RAM consumption, 36GB of 64GB) ..... loss in epoch 1 iteration 7750 / 47471: 0.3159805238246918 (host RAM consumption, 50GB of 64GB, GPU RAM 4G of 16GB)

@lightsailpro Sorry for it, it could be frustrated when trying to train on a larger dataset but got OOM, formly, this kind of issue requires elaborating into details with help of profilers https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ , I suggested to try original tf version of SASRec before, if it's not preferred, pls try delete some variables after each iteration which I created like in https://github.com/pmixer/SASRec.pytorch/blob/4297d0950c6a6bffba92ecd4d1bf7204d07eb8c7/main.py#L96 just del those u think no longer needed after each training iteration https://stackoverflow.com/questions/26545051/is-there-a-way-to-delete-created-variables-functions-etc-from-the-memory-of-th

alan-ai-learner commented 1 year ago

Any luck here guys?

pmixer commented 1 year ago

Any luck here guys?

I'm afraid not, all sampling etc. code should be the same whether using cpu or gpu, the major difference causing memory leak on cpu but not on gpu might be rooted in pytorch implementation diff on cpu and gpu itself. For which I recommend trying pytorch profilier and switching to tf version SASRec if keep using cpu for training. If its for memory leak on gpu, I may have more experience or expertise for debugging it.

pmixer / SASRec.pytorch

Possible memory leak during iteration for large number users (10+ millions)? #25