Remove embedders from GPU memory post-embed phase in pipeline

sacdallago commented 4 years ago

I once had a situation in which the pipeline was in a "visualize" stage, but the GPU was still occupied by the embedder (SeqVec).

I had assumed that the embedder is destroyed adter the embed stage (the stages are written in a way whuch should make python's authomatic garbage collection easy). But apparently I was wrong.

Maybe it makes sence to explecitely del embedder at the end of the embed stage.

It's worth looking into this. The visualize stage is sometimes slow (it can take up 2 days on for big plots)... Occupying GPU resources for no good reason is a waste in those cases. In the future (e.g. with extract) GPU RAM will be needed

konstin commented 4 years ago

I've looked into the code and it's not apparent to me how that happens. Could you give me pipeline config with which that happened?

sacdallago commented 4 years ago

/mnt/nfs2/projects/bio_embeddings_extract_unsupervised/config.yml

In case you re-run this, please run this in a different directory (I'm actively working in that dir!). Notice that embedding will take a few hours, after that: you'll get stuck with the memory footprint of SeqVec on the GPU while the plot is being created.

konstin commented 4 years ago

It seems that this isn't really about the embedder, but about something that pytorch's own gc does.

With the embedder, nvidia-smi says 901 MB and torch.cuda.memory_allocated() says 374404608. After dropping the model, torch.cuda.memory_allocated() reports 0 and torch.cuda.empty_cache() reduces the allocated memory from 901 MB to 537 MB, however active.all.allocated says 1061.

This looks like a memory leak in pytorch to me :/

sacdallago commented 4 years ago

Oh wow...

There's a bunch of memory leak issues opened, some sound very close to what might be happening here: https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+memory+leak

But with 4.8k issues opened... I wonder if it makes sense to add to the noise...

konstin commented 4 years ago

What really confused me is that this happens both with seqvec and with bert and I can't find any similar report in either of the repos.

Maybe something useful to debug that will come out of https://github.com/pytorch/pytorch/issues/42815.

If that becomes a real problem for users, I'd go the radical way and run the embed step in a subprocess.

sacdallago commented 4 years ago

Now this is starting to become terribly annoying.

Since I'm transferring the reduced embeddings to GPU RAM when doing the annotation transfer, this happens:

2020-09-01 16:51:23,022 INFO Created the file localization_transfer/bert_embeddings/ouput_parameters_file.yml
2020-09-01 16:51:23,038 INFO Created the stage directory localization_transfer/annotations_from_bert
2020-09-01 16:51:23,040 INFO Created the file localization_transfer/annotations_from_bert/input_parameters_file.yml
2020-09-01 16:51:23,048 INFO Created the file localization_transfer/annotations_from_bert/transferred_annotations_file.csv
2020-09-01 16:51:23,166 INFO Created the file localization_transfer/annotations_from_bert/input_reference_annotations_file.csv
2020-09-01 16:51:23,262 INFO Created the file localization_transfer/annotations_from_bert/input_reference_embeddings_file.h5
Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    parse_config_file_and_execute_run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 202, in parse_config_file_and_execute_run
    execute_pipeline_from_config(config, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 169, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 347, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 150, in unsupervised
    pairwise_distances = _pairwise_distance_matrix(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 46, in _pairwise_distance_matrix
    distances_squared = norms - 2 * sample_1.mm(sample_2.t())
RuntimeError: CUDA out of memory. Tried to allocate 24.64 GiB (GPU 0; 47.46 GiB total capacity; 25.51 GiB already allocated; 2.63 GiB free; 43.98 GiB reserved in total by PyTorch)

The problem here may be either the 25.51 GiB already allocated or 43.98 GiB reserved in total by PyTorch.

Workaround fro now:

set the cuda device to CPU in extract when this is optionable via the different open PRs on GitLab...
do the extract stage in a separate pipeline run (this is what I'll do for this instance)

EDIT: It might actually be that I'm simply creating a new tensor (product of another 20GB tensor) which then doesn't fit in memory 🤔 not sure... Will do some testing.

sacdallago commented 4 years ago

[UPDATE to https://github.com/sacdallago/bio_embeddings/issues/40#issuecomment-685429949]

Running just the transfer part (config with only that stage, passing the embeddings from the previous run) still throws an OOM error:

2020-09-02 10:54:02,015 INFO Created the file localization_transfer/input_parameters_file.yml
2020-09-02 10:54:05,033 INFO Created the file localization_transfer/sequences_file.fasta
2020-09-02 10:54:07,230 INFO Created the file localization_transfer/mapping_file.csv
2020-09-02 10:54:07,259 INFO Created the file localization_transfer/remapped_sequences_file.fasta
2020-09-02 10:54:09,394 INFO Stage directory localization_transfer/annotations_from_bert already exists.
2020-09-02 10:54:09,411 INFO Created the file localization_transfer/annotations_from_bert/input_parameters_file.yml
2020-09-02 10:54:09,422 INFO Created the file localization_transfer/annotations_from_bert/transferred_annotations_file.csv
2020-09-02 10:54:09,542 INFO Created the file localization_transfer/annotations_from_bert/input_reference_annotations_file.csv
2020-09-02 10:54:09,697 INFO Created the file localization_transfer/annotations_from_bert/input_reference_embeddings_file.h5
Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/bin/bio_embeddings", line 8, in <module>
    sys.exit(main())
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/cli.py", line 22, in main
    parse_config_file_and_execute_run(arguments.config_path[0], overwrite=arguments.overwrite)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 202, in parse_config_file_and_execute_run
    execute_pipeline_from_config(config, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 169, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 347, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 150, in unsupervised
    pairwise_distances = _pairwise_distance_matrix(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/chris_experiments/lib/python3.8/site-packages/bio_embeddings/extract/pipeline.py", line 46, in _pairwise_distance_matrix
    distances_squared = norms - 2 * sample_1.mm(sample_2.t())
RuntimeError: CUDA out of memory. Tried to allocate 24.64 GiB (GPU 0; 47.46 GiB total capacity; 25.51 GiB already allocated; 21.10 GiB free; 25.51 GiB reserved in total by PyTorch)

But clearly, 25.51 GiB reserved in total by PyTorch is different from 43.98 GiB reserved in total by PyTorch.

konstin commented 3 years ago

New statistics: A round trip with all embedder (instantiate, embed PROTEIN and SEQWENCE, del, gc.collect(), torch.cuda.empty_cache()) leads to 0.8GB being leaked with torch 1.5. Notably without empty_cache it's 4.8GB, so some of the GPU memory might look blocked for the next stages when it's actually only used by torch's gc cache.

konstin commented 3 years ago

I've tried to get size of the DeepBLAST model but I keep getting contradictory results. The used memory according to nvidia-smi varies depending on the GPU and the torch version from 740MiB (Titan X, torch 1.5) to 1284MiB (RTX 8000, torch 1.7).

After I delete the model, gc.collect(), empty the cuda cache and check that the gc doesn't track any tensors anymore, I still see about 1GB of memory being used. However doing the load-model-delete-clear cycle multiple times show the exact same numbers, so it's not a memory leak.

I feel that must be missing something obvious, but all google results only point to what I've already checked :confused:

tijeco commented 3 years ago

What really confused me is that this happens both with seqvec and with bert and I can't find any similar report in either of the repos.

Maybe something useful to debug that will come out of pytorch/pytorch#42815.

If that becomes a real problem for users, I'd go the radical way and run the embed step in a subprocess.

@sacdallago and @sacdallago

What would be the recommended way to do embedding in a subprocess? I'm having this issue with UniRep (I'm using v0.2.0) where in colab it runs out of memory after ~200 small peptide sequences in both GPU and CPU.

I've even moved things over to a server (CPU only) with ~1TB of RAM and it's climbing it's way up to using 70GB now, so I'll probably stop it soon, though I'm not sure how far along it is.

I'm only having this issue with UniRep though by the way!

konstin commented 3 years ago

@tijeco Could you please open a new issue with details on how to reproduce this (ideally with the fasta file, if you can publish it)?

tijeco commented 3 years ago

@konstin Will do!

sacdallago / bio_embeddings

Remove embedders from GPU memory post-embed phase in pipeline #40