qqwang-berkeley / JUM

A tool for annotation-free differential analysis of tissue-specific pre-mRNA alternative splicing patterns
MIT License
28 stars 13 forks source link

Huge amount of space required!!! #31

Closed Yago-91 closed 3 years ago

Yago-91 commented 4 years ago

Hi! I would like to ask about the temporal intermediate files generated during JUM_B.sh step. Among the generated files are those named *temp_long_intron_retention_junction_coordinate_with_read_num_pvalue_1.txt. Is it normal that these text files reach the astronomic sizes of 27-30 GB for bam files with average size of 5 GB? I'm trying to analyze a 3 vs 3 experiment and so far the folder with all the required data for this script is about 366 GB in size, so I suppose analyzing an experiment of 10 vs 10 is simply unfeasible, no to say large experiments. Please comment on this. Thanks in advance. Ivó

qqwang-berkeley commented 4 years ago

Dear Ivó,

For a bam file with average size of 7GB you should only get text files like temp_long_intron_retention_junction_coordinate_with_read_num_pvalue_1.txt about 2.8MB.

Yes JUM does generate relatively big intermediate files during each step, but it is manageable - here is an example: for a comparison of 6 vs 6 vs 8 samples (total 20, three way comparison), with each bam file size 6-10GB, the JUM_A step took a total of ~396GB, JUM_B step took ~174GB, and JUM_C step took 163M. And please keep in mind that JUM put these temporary files into a temp folder which users can delete after each step is finished without error. So the burden for storage is about ~396Gb at a given time for 20 samples with decent sequencing depth.

However, that said, I have been working on using more efficient storage format or tools for these intermediate files and to delete them automatically when they are done, so that JUM running will be more space friendly. In the current version I am leaving them there for sanity check and in case users need to dig deeper. The space saving should come up in the upcoming updates.

Let me know if you have any other questions.

Qingqing

On Wed, May 13, 2020 at 4:36 AM Yago-91 notifications@github.com wrote:

Hi! I would like to ask about the temporal intermediate files generated during JUM_B.sh step. Among the generated files are those named *temp_long_intron_retention_junction_coordinate_with_read_num_pvalue_1.txt. Is it normal that these text files reach the astronomic sizes of 27-30 GB for bam files with average size of 5 GB? I'm trying to analyze a 3 vs 3 experiment and so far the folder with all the required data for this script is about 366 GB in size, so I suppose analyzing an experiment of 10 vs 10 is simply unfeasible, no to say large experiments. Please comment on this. Thanks in advance. Ivó

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qqwang-berkeley/JUM/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJ6PW2PYQDQISCACQWW4PTRRJLZLANCNFSM4M7RGIZA .