High Storage Requirements for Running vep_embeddings.py in EQTL VEP Analysis

kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Apache License 2.0

137 stars 14 forks source link

High Storage Requirements for Running vep_embeddings.py in EQTL VEP Analysis #28

Closed yangzhao1230 closed 1 month ago

yangzhao1230 commented 2 months ago

Hi,

Thanks for your great work and well-documented code.

I'm currently facing an issue related to the storage requirements for running the vep_embeddings.py script as part of the EQTL VEP analysis. It seems that the script requires an unexpectedly high amount of storage space.

To date, running this script has already consumed over 800GB of local storage, and it continues to require more. This level of storage demand is quite substantial and seems unusual for typical usage. I would like to understand if this is expected behavior or if there might be a potential issue with how the script handles data.

yair-schiff commented 2 months ago

This definitely sounds like a memory leak of some sort, unfortunately. I did not observe such high memory usage

yangzhao1230 commented 2 months ago

Today, I restarted using a machine with 8 GPUs to run the vep_embeddings.py script from your Caduceus project. I set the environment variable HF_HOME in a newly created directory under the project directory named huggingface, meaning that all memory generated is solely from this script. The memory generated and the configuration I used can be seen in my figure. Additionally, this script has been running for almost a day now and it still hasn't finished processing. I would like to know if such high memory and time requirements are normal. Thank you. My configuration is exactly based on the Enformer setup you provided, with the same number of GPUs; the only change I made was to increase the number of workers.

yangzhao1230 commented 2 months ago

The figure shows Enformer, but I've experienced similar high memory and time requirements using Caduceus as well. Currently, I have not yet finished the step of preparing the dataset.

yangzhao1230 commented 2 months ago

This definitely sounds like a memory leak of some sort, unfortunately. I did not observe such high memory usage

Thank you for your attention to this issue. I wanted to clarify that my reference to "800GB of local storage" specifically pertains to the disk storage space used during the "prepare dataset" stage of our analysis, not memory (RAM). It seems there was a misunderstanding regarding the type of resource being heavily utilized.

Additionally, I'd like to inquire if it's normal for the "prepare dataset" stage to consume approximately 1TB of disk storage and about a day's worth of processing time per method. I'm trying to gauge if the storage and time demands I'm experiencing during this particular stage are typical or if there might be something unusual with my setup or the way the data is handled.

yair-schiff commented 2 months ago

What sequence length are you downloading for this dataset? On disk size for 131k for me is about 350G

yangzhao1230 commented 1 month ago

What sequence length are you downloading for this dataset? On disk size for 131k for me is about 350G

The same as you. Thanks!

yangzhao1230 commented 1 month ago

The reason for the previous high storage usage was due to many cache files. By the way, I want to ask about this task. In the paper, are the results obtained by taking the longest theoretical length of each model for embedding, then using the center 1536bp to train the SVM?

yair-schiff commented 1 month ago

The length is model-specific, yes. Please see here for what we use for each model: https://github.com/kuleshov-group/caduceus/blob/hf_finetune/slurm_scripts/dump_vep_embeddings.sh

Yes, we take the center 1536bp, but please note that this will translate to a different number of tokens depending on the model. For example, for Caduceus since we use bp tokenization, this will also correspond to 1536 tokens, but for Nucleotide Transformer, which uses 6-mer tokenization, this will correspond to 1536/ 6 = 256 tokens.

yangzhao1230 commented 1 month ago

@yair-schiff Thanks for your help. But I'd like to know why Caduceus-PS uses an input length of 16k?

yair-schiff commented 1 month ago

That is a typo. Thanks for catching that! I'll push a correction later today

yair-schiff commented 1 month ago

@yangzhao1230 , this should be fixed now.