Closed yangzhao1230 closed 1 month ago
This definitely sounds like a memory leak of some sort, unfortunately. I did not observe such high memory usage
Today, I restarted using a machine with 8 GPUs to run the vep_embeddings.py script from your Caduceus project. I set the environment variable HF_HOME in a newly created directory under the project directory named huggingface, meaning that all memory generated is solely from this script. The memory generated and the configuration I used can be seen in my figure. Additionally, this script has been running for almost a day now and it still hasn't finished processing. I would like to know if such high memory and time requirements are normal. Thank you. My configuration is exactly based on the Enformer setup you provided, with the same number of GPUs; the only change I made was to increase the number of workers.
The figure shows Enformer, but I've experienced similar high memory and time requirements using Caduceus as well. Currently, I have not yet finished the step of preparing the dataset.
This definitely sounds like a memory leak of some sort, unfortunately. I did not observe such high memory usage
Thank you for your attention to this issue. I wanted to clarify that my reference to "800GB of local storage" specifically pertains to the disk storage space used during the "prepare dataset" stage of our analysis, not memory (RAM). It seems there was a misunderstanding regarding the type of resource being heavily utilized.
Additionally, I'd like to inquire if it's normal for the "prepare dataset" stage to consume approximately 1TB of disk storage and about a day's worth of processing time per method. I'm trying to gauge if the storage and time demands I'm experiencing during this particular stage are typical or if there might be something unusual with my setup or the way the data is handled.
What sequence length are you downloading for this dataset? On disk size for 131k for me is about 350G
What sequence length are you downloading for this dataset? On disk size for 131k for me is about
350G
The same as you. Thanks!
The reason for the previous high storage usage was due to many cache files. By the way, I want to ask about this task. In the paper, are the results obtained by taking the longest theoretical length of each model for embedding, then using the center 1536bp to train the SVM?
The length is model-specific, yes. Please see here for what we use for each model: https://github.com/kuleshov-group/caduceus/blob/hf_finetune/slurm_scripts/dump_vep_embeddings.sh
Yes, we take the center 1536bp, but please note that this will translate to a different number of tokens depending on the model. For example, for Caduceus since we use bp tokenization, this will also correspond to 1536 tokens, but for Nucleotide Transformer, which uses 6-mer tokenization, this will correspond to 1536/ 6 = 256 tokens.
@yair-schiff Thanks for your help. But I'd like to know why Caduceus-PS uses an input length of 16k?
That is a typo. Thanks for catching that! I'll push a correction later today
@yangzhao1230 , this should be fixed now.
Hi,
Thanks for your great work and well-documented code.
I'm currently facing an issue related to the storage requirements for running the vep_embeddings.py script as part of the EQTL VEP analysis. It seems that the script requires an unexpectedly high amount of storage space.
To date, running this script has already consumed over 800GB of local storage, and it continues to require more. This level of storage demand is quite substantial and seems unusual for typical usage. I would like to understand if this is expected behavior or if there might be a potential issue with how the script handles data.