Open sacdallago opened 4 years ago
I've looked into the code and it's not apparent to me how that happens. Could you give me pipeline config with which that happened?
/mnt/nfs2/projects/bio_embeddings_extract_unsupervised/config.yml
In case you re-run this, please run this in a different directory (I'm actively working in that dir!). Notice that embedding will take a few hours, after that: you'll get stuck with the memory footprint of SeqVec on the GPU while the plot is being created.
It seems that this isn't really about the embedder, but about something that pytorch's own gc does.
With the embedder, nvidia-smi says 901 MB and torch.cuda.memory_allocated()
says 374404608. After dropping the model, torch.cuda.memory_allocated()
reports 0 and torch.cuda.empty_cache()
reduces the allocated memory from 901 MB to 537 MB, however active.all.allocated
says 1061.
This looks like a memory leak in pytorch to me :/
Oh wow...
There's a bunch of memory leak issues opened, some sound very close to what might be happening here: https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+memory+leak
But with 4.8k issues opened... I wonder if it makes sense to add to the noise...
What really confused me is that this happens both with seqvec and with bert and I can't find any similar report in either of the repos.
Maybe something useful to debug that will come out of https://github.com/pytorch/pytorch/issues/42815.
If that becomes a real problem for users, I'd go the radical way and run the embed step in a subprocess.
Now this is starting to become terribly annoying.
Since I'm transferring the reduced embeddings to GPU RAM when doing the annotation transfer, this happens:
2020-09-01 16:51:23,022 INFO Created the file localization_transfer/bert_embeddings/ouput_parameters_file.yml
2020-09-01 16:51:23,038 INFO Created the stage directory localization_transfer/annotations_from_bert
2020-09-01 16:51:23,040 INFO Created the file localization_transfer/annotations_from_bert/input_parameters_file.yml
2020-09-01 16:51:23,048 INFO Created the file localization_transfer/annotations_from_bert/transferred_annotations_file.csv
2020-09-01 16:51:23,166 INFO Created the file localization_transfer/annotations_from_bert/input_reference_annotations_file.csv
2020-09-01 16:51:23,262 INFO Created the file localization_transfer/annotations_from_bert/input_reference_embeddings_file.h5
Traceback (most recent call last):
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/bin/bio_embeddings", line 8, in <module>
sys.exit(main())
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
parse_config_file_and_execute_run(arguments.config_path[0], overwrite=arguments.overwrite)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 202, in parse_config_file_and_execute_run
execute_pipeline_from_config(config, **kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 169, in execute_pipeline_from_config
stage_output_parameters = stage_runnable(**stage_parameters)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 347, in run
return PROTOCOLS[kwargs["protocol"]](**kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 150, in unsupervised
pairwise_distances = _pairwise_distance_matrix(
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 46, in _pairwise_distance_matrix
distances_squared = norms - 2 * sample_1.mm(sample_2.t())
RuntimeError: CUDA out of memory. Tried to allocate 24.64 GiB (GPU 0; 47.46 GiB total capacity; 25.51 GiB already allocated; 2.63 GiB free; 43.98 GiB reserved in total by PyTorch)
The problem here may be either the 25.51 GiB already allocated
or 43.98 GiB reserved in total by PyTorch
.
Workaround fro now:
extract
when this is optionable via the different open PRs on GitLab...EDIT: It might actually be that I'm simply creating a new tensor (product of another 20GB tensor) which then doesn't fit in memory 🤔 not sure... Will do some testing.
[UPDATE to https://github.com/sacdallago/bio_embeddings/issues/40#issuecomment-685429949]
Running just the transfer part (config with only that stage, passing the embeddings from the previous run) still throws an OOM error:
2020-09-02 10:54:02,015 INFO Created the file localization_transfer/input_parameters_file.yml
2020-09-02 10:54:05,033 INFO Created the file localization_transfer/sequences_file.fasta
2020-09-02 10:54:07,230 INFO Created the file localization_transfer/mapping_file.csv
2020-09-02 10:54:07,259 INFO Created the file localization_transfer/remapped_sequences_file.fasta
2020-09-02 10:54:09,394 INFO Stage directory localization_transfer/annotations_from_bert already exists.
2020-09-02 10:54:09,411 INFO Created the file localization_transfer/annotations_from_bert/input_parameters_file.yml
2020-09-02 10:54:09,422 INFO Created the file localization_transfer/annotations_from_bert/transferred_annotations_file.csv
2020-09-02 10:54:09,542 INFO Created the file localization_transfer/annotations_from_bert/input_reference_annotations_file.csv
2020-09-02 10:54:09,697 INFO Created the file localization_transfer/annotations_from_bert/input_reference_embeddings_file.h5
Traceback (most recent call last):
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/bin/bio_embeddings", line 8, in <module>
sys.exit(main())
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
parse_config_file_and_execute_run(arguments.config_path[0], overwrite=arguments.overwrite)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 202, in parse_config_file_and_execute_run
execute_pipeline_from_config(config, **kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 169, in execute_pipeline_from_config
stage_output_parameters = stage_runnable(**stage_parameters)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 347, in run
return PROTOCOLS[kwargs["protocol"]](**kwargs)
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 150, in unsupervised
pairwise_distances = _pairwise_distance_matrix(
File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 46, in _pairwise_distance_matrix
distances_squared = norms - 2 * sample_1.mm(sample_2.t())
RuntimeError: CUDA out of memory. Tried to allocate 24.64 GiB (GPU 0; 47.46 GiB total capacity; 25.51 GiB already allocated; 21.10 GiB free; 25.51 GiB reserved in total by PyTorch)
But clearly, 25.51 GiB reserved in total by PyTorch
is different from 43.98 GiB reserved in total by PyTorch
.
New statistics: A round trip with all embedder (instantiate, embed PROTEIN and SEQWENCE, del, gc.collect(), torch.cuda.empty_cache()) leads to 0.8GB being leaked with torch 1.5. Notably without empty_cache it's 4.8GB, so some of the GPU memory might look blocked for the next stages when it's actually only used by torch's gc cache.
I've tried to get size of the DeepBLAST model but I keep getting contradictory results. The used memory according to nvidia-smi varies depending on the GPU and the torch version from 740MiB (Titan X, torch 1.5) to 1284MiB (RTX 8000, torch 1.7).
After I delete the model, gc.collect(), empty the cuda cache and check that the gc doesn't track any tensors anymore, I still see about 1GB of memory being used. However doing the load-model-delete-clear cycle multiple times show the exact same numbers, so it's not a memory leak.
I feel that must be missing something obvious, but all google results only point to what I've already checked :confused:
What really confused me is that this happens both with seqvec and with bert and I can't find any similar report in either of the repos.
Maybe something useful to debug that will come out of pytorch/pytorch#42815.
If that becomes a real problem for users, I'd go the radical way and run the embed step in a subprocess.
@sacdallago and @sacdallago
What would be the recommended way to do embedding in a subprocess? I'm having this issue with UniRep (I'm using v0.2.0
) where in colab it runs out of memory after ~200 small peptide sequences in both GPU and CPU.
I've even moved things over to a server (CPU only) with ~1TB of RAM and it's climbing it's way up to using 70GB now, so I'll probably stop it soon, though I'm not sure how far along it is.
I'm only having this issue with UniRep though by the way!
@tijeco Could you please open a new issue with details on how to reproduce this (ideally with the fasta file, if you can publish it)?
@konstin Will do!
I once had a situation in which the pipeline was in a "visualize" stage, but the GPU was still occupied by the embedder (SeqVec).
I had assumed that the embedder is destroyed adter the embed stage (the stages are written in a way whuch should make python's authomatic garbage collection easy). But apparently I was wrong.
Maybe it makes sence to explecitely
del embedder
at the end of the embed stage.It's worth looking into this. The visualize stage is sometimes slow (it can take up 2 days on for big plots)... Occupying GPU resources for no good reason is a waste in those cases. In the future (e.g. with extract) GPU RAM will be needed