uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
5 stars 1 forks source link

`summarizeFasta` takes a lot of time? #795

Closed lydiayliu closed 11 months ago

lydiayliu commented 11 months ago

I just noticed this, but I've run summarizeFasta by itself on fastas of a variety of different sizes and it seems like taking ~40 minutes to run is the standard.

split/CPCG0269/CPCG0269_split4_CodonReassign.fasta                                                                                                                                      
[ 2023-08-11 00:58:30 ] moPepGen summarizeFasta started                                                                                                                                 
[ 2023-08-11 00:58:32 ] Reference indices loaded.                                                                                                                                       
[ 2023-08-11 01:38:41 ] rss=281.7 MiB, vms=863.4 MiB, wallclock=0:38:59.150000, system=0:01:09.060000, cpu_usage=99.8%                                                                  
split/CPCG0269/CPCG0269_split4_Noncoding.fasta                                                                                                                                          
[ 2023-08-11 01:38:41 ] moPepGen summarizeFasta started                                                                                                                                 
[ 2023-08-11 01:38:43 ] Reference indices loaded.                                                                                                                                       
[ 2023-08-11 02:21:10 ] rss=429.4 MiB, vms=1011 MiB, wallclock=0:41:14.670000, system=0:01:14.810000, cpu_usage=100.0%                                                                  
split/CPCG0269/CPCG0269_split4_NotCirc-Noncoding.fasta                                                                                                                                  
[ 2023-08-11 02:21:11 ] moPepGen summarizeFasta started                                                                                                                                 
[ 2023-08-11 02:21:12 ] Reference indices loaded.                                                                                                                                       
[ 2023-08-11 03:00:29 ] rss=272.2 MiB, vms=853.9 MiB, wallclock=0:38:00.870000, system=0:01:17.080000, cpu_usage=99.9%                                                                  
split/CPCG0269/CPCG0269_split4_NotCirc-circRNA-Noncoding.fasta                                                                                                                          
[ 2023-08-11 03:00:30 ] moPepGen summarizeFasta started                                                                                                                                 
[ 2023-08-11 03:00:31 ] Reference indices loaded.                                                                                                                                       
[ 2023-08-11 03:37:57 ] rss=269.8 MiB, vms=851.6 MiB, wallclock=0:36:16.760000, system=0:01:11.270000, cpu_usage=100.0%                                                                 
split/CPCG0269/CPCG0269_split4_NotCirc-circRNA.fasta                                                                                                                                    
[ 2023-08-11 03:37:57 ] moPepGen summarizeFasta started                                     
[ 2023-08-11 03:37:59 ] Reference indices loaded.                                           
[ 2023-08-11 04:15:30 ] rss=269.3 MiB, vms=850.8 MiB, wallclock=0:36:23.170000, system=0:01:10.480000, cpu_usage=100.0%                                                                 
split/CPCG0269/CPCG0269_split4_NotCirc.fasta                                                
[ 2023-08-11 04:15:31 ] moPepGen summarizeFasta started                                                                                                                                 
[ 2023-08-11 04:15:32 ] Reference indices loaded.                                                                                                                                       
[ 2023-08-11 04:57:34 ] rss=275.5 MiB, vms=857.1 MiB, wallclock=0:40:45.220000, system=0:01:15.110000, cpu_usage=99.8%                                                                  
split/CPCG0269/CPCG0269_split4_SECT-CodonReassign.fasta                                                                                                                                 
[ 2023-08-11 04:57:35 ] moPepGen summarizeFasta started                                     
[ 2023-08-11 04:57:37 ] Reference indices loaded.                                           
[ 2023-08-11 05:35:34 ] rss=268.2 MiB, vms=849.9 MiB, wallclock=0:36:47.090000, system=0:01:12.870000, cpu_usage=100.0%                                                                 
split/CPCG0269/CPCG0269_split4_SECT.fasta     
[ 2023-08-11 05:35:35 ] moPepGen summarizeFasta started                                     
[ 2023-08-11 05:35:36 ] Reference indices loaded.                                           
[ 2023-08-11 06:14:12 ] rss=270.7 MiB, vms=852.4 MiB, wallclock=0:37:23.130000, system=0:01:13.930000, cpu_usage=99.9% 

This is the commend I used

moPepGen summarizeFasta \
    --variant-peptides ${b} \
    --noncoding-peptides /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/gencode34_default_noncoding_peptides.fa \
    --gvf ${c} \
    --cleavage-rule trypsin \
    --index-dir /hot/project/disease/ProstateTumor/PRAD-000051-MIAPEP/Index/GRCh38-EBI-GENCODE34/ \
    --output-path ${a}/call-NonCanonicalPeptide-1.0.0/${a}/moPepGen-0.2.0-fdb97d0/output/${a}-variantPeptides_summary.txt

I know CPCG has a lot of GVFs but that doesn't seem to be the reason. It's possible that it is due to loading the noncoding-peptides file but summarizeFasta has always loaded the file. Could it be because now it needs to read the reference a lot from disk?

I also looked into CCLE, and realized that in the call-noncanonical peptide pipeline, now summarize_fasta is the road block?

1       69/de93a3       18537   call_parsers:parse_STARFusion (1)       COMPLETED       0       2023-08-10 10:33:51.614 43.6s   14.4s   69.2%   3.2 GB  5.2 GB  3.1 GB  117.3 KB
2       c5/5cd4d9       18432   call_parsers:parse_VEP (1)      COMPLETED       0       2023-08-10 10:33:51.540 49.4s   20s     41.1%   3.3 GB  5.4 GB  3.2 GB  489.4 KB
3       ef/9881d8       34004   call_variant    COMPLETED       0       2023-08-10 10:34:41.021 2m 17s  2m 17s  266.9%  5.9 GB  27.9 GB 4 GB    597.6 MB
5       7c/62b652       40894   process_database_merge:merge_fasta      COMPLETED       0       2023-08-10 10:36:58.392 55.8s   54.5s   106.8%  3.7 GB  5.8 GB  349.2 MB        332.9 MB
4       08/f3f4f3       40902   summarize_fasta COMPLETED       0       2023-08-10 10:36:58.398 33m 39s 33m 39s 100.0%  3.9 GB  6 GB    266.8 GB        25.3 KB
6       62/28a65f       60425   process_database_merge:summarize_fasta_merge    COMPLETED       0       2023-08-10 10:37:54.267 33m 3s  33m 3s  100.4%  3.9 GB  5.9 GB  266.8 GB        25.3 KB

You can see the trace files in this working directory here: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/work/4b/d60d67c5c7dc71b0cc63ae55ecb983/pipeline-meta-call-NonCanonicalPeptide-0.0.1/

Btw didn't realize that call_variant is this fast XD

As an example with a CCLE sample run on May 18 (before the GTF memory change):

yiyangliu@ip-0A125224:/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/work/9a/b340798242ced9470681bd587e3e55/pipeline-meta-call-NonCanonicalPeptide-0.0.1$ cat */*/*/log*/nextflow-log/trace.txt
task_id hash    native_id       name    status  exit    submit  duration        realtime        %cpu    peak_rss        peak_vmem       rchar   wchar
2       60/e23e36       11665   call_parsers:parse_VEP (1)      COMPLETED       0       2023-05-19 17:28:05.695 2m 41s  2m 31s  74.2%   9.1 GB  10.2 GB 3.1 GB  384.6 KB
4       3d/bbefab       11652   call_parsers:parse_STARFusion (1)       COMPLETED       0       2023-05-19 17:28:05.679 2m 42s  2m 31s  74.4%   9.1 GB  10.2 GB 3.1 GB  1.2 MB
1       c6/f6437d       31619   process_database_split:filter_fasta_alt_translation     COMPLETED       0       2023-05-19 17:30:47.123 43.4s   42.5s   93.9%   653.6 MB        1.7 GB 69.7 MB  25.9 MB
3       1b/8d6863       31811   process_database_split:filter_fasta_noncoding   COMPLETED       0       2023-05-19 17:30:47.490 3m 18s  3m 17s  96.2%   3.5 GB  4.6 GB  327.8 MB       80 MB
5       34/24cddb       18497   call_variant    COMPLETED       0       2023-05-19 17:34:05.055 3m 47s  3m 46s  196.9%  11.3 GB 31.4 GB 3.9 GB  541.1 MB
7       f8/b6b219       22537   process_database_plain:filter_fasta     COMPLETED       0       2023-05-19 17:37:52.346 3s      2.1s    92.0%   84.6 MB 1.3 GB  30.8 MB 912.2 KB
8       33/02cfd5       22545   process_database_split:filter_fasta_variant     COMPLETED       0       2023-05-19 17:37:52.350 3.2s    2.3s    86.8%   141.2 MB        1.3 GB  30.8 MB912.2 KB
10      d1/ae58b1       22535   process_database_merge:merge_fasta      COMPLETED       0       2023-05-19 17:37:52.343 1m 2s   1m 1s   97.1%   3.7 GB  4.8 GB  355.7 MB        338.6 MB
11      ef/7dc1ab       24071   process_database_plain:summarize_fasta  COMPLETED       0       2023-05-19 17:37:55.327 1m 55s  1m 54s  98.0%   6.1 GB  7.2 GB  35.6 MB 25.1 KB
12      4e/22b660       24332   process_database_split:split_fasta_filtered     COMPLETED       0       2023-05-19 17:37:55.578 2m 54s  2m 53s  97.0%   7.9 GB  9 GB    203.1 MB       106.7 MB
13      16/6926aa       24325   process_database_split:summarize_fasta  COMPLETED       0       2023-05-19 17:37:55.574 2m 58s  2m 57s  97.4%   7.6 GB  8.7 GB  141.5 MB        25.7 KB
9       f5/d21b02       24042   process_database_split:split_fasta_unfiltered   COMPLETED       0       2023-05-19 17:37:55.302 4m 22s  4m 21s  97.5%   10.1 GB 11.2 GB 435 MB  338.6 MB
6       38/516270       22556   summarize_fasta COMPLETED       0       2023-05-19 17:37:52.354 4m 33s  4m 32s  98.2%   9.7 GB  10.8 GB 373.3 MB        25.7 KB
14      a6/7c68eb       1325    process_database_merge:summarize_fasta_merge    COMPLETED       0       2023-05-19 17:38:55.503 4m 19s  4m 18s  97.9%   9.7 GB  10.8 GB 373.3 MB       25.7 KB
15      33/8b5209       12463   process_database_merge:filter_fasta_merged      COMPLETED       0       2023-05-19 17:40:49.105 4m      4m      98.1%   4 GB    5.1 GB  368.5 MB       106.7 MB
16      0b/64f1cc       25675   process_database_merge:summarize_fasta_merge_filtered   COMPLETED       0       2023-05-19 17:44:49.519 2m 55s  2m 55s  98.1%   7.6 GB  8.7 GB  141.5 MB25.7 KB
lydiayliu commented 11 months ago

Update: it seems like the difference is mainly due to loading Noncoding and AltTranslation? But how come the dramatic time difference from before??

split/CPCG0269/CPCG0269_split4_CodonReassign.fasta
[ 2023-08-13 01:20:49 ] moPepGen summarizeFasta started
[ 2023-08-13 01:20:51 ] Reference indices loaded.
[ 2023-08-13 01:42:15 ] rss=187.8 MiB, vms=769.5 MiB, wallclock=0:21:07.730000, system=0:00:16.280000, cpu_usage=99.7%
split/CPCG0269/CPCG0269_split4_Noncoding.fasta
[ 2023-08-13 01:42:16 ] moPepGen summarizeFasta started
[ 2023-08-13 01:42:17 ] Reference indices loaded.
[ 2023-08-13 01:57:31 ] rss=231.5 MiB, vms=813.1 MiB, wallclock=0:14:23.560000, system=0:00:53.840000, cpu_usage=100.0%
split/CPCG0269/CPCG0269_split4_NotCirc-Noncoding.fasta
[ 2023-08-13 01:57:32 ] moPepGen summarizeFasta started
[ 2023-08-13 01:57:33 ] Reference indices loaded.
[ 2023-08-13 01:59:43 ] rss=186.6 MiB, vms=766.3 MiB, wallclock=0:02:04.970000, system=0:00:07.150000, cpu_usage=99.7%
split/CPCG0269/CPCG0269_split4_NotCirc-circRNA-Noncoding.fasta
[ 2023-08-13 01:59:43 ] moPepGen summarizeFasta started
[ 2023-08-13 01:59:45 ] Reference indices loaded.
[ 2023-08-13 02:00:12 ] rss=169.6 MiB, vms=750.6 MiB, wallclock=0:00:28.610000, system=0:00:01.680000, cpu_usage=99.9%
split/CPCG0269/CPCG0269_split4_NotCirc-circRNA.fasta
[ 2023-08-13 02:00:13 ] moPepGen summarizeFasta started
[ 2023-08-13 02:00:14 ] Reference indices loaded.
[ 2023-08-13 02:00:44 ] rss=173.2 MiB, vms=755 MiB, wallclock=0:00:31.350000, system=0:00:01.690000, cpu_usage=99.8%
split/CPCG0269/CPCG0269_split4_NotCirc.fasta
[ 2023-08-13 02:00:45 ] moPepGen summarizeFasta started
[ 2023-08-13 02:00:46 ] Reference indices loaded.
[ 2023-08-13 02:05:33 ] rss=186 MiB, vms=767.7 MiB, wallclock=0:04:44.770000, system=0:00:04.840000, cpu_usage=100.0%
split/CPCG0269/CPCG0269_split4_SECT-CodonReassign.fasta
[ 2023-08-13 02:05:33 ] moPepGen summarizeFasta started
[ 2023-08-13 02:05:35 ] Reference indices loaded.
[ 2023-08-13 02:06:02 ] rss=169.7 MiB, vms=751.3 MiB, wallclock=0:00:28.430000, system=0:00:01.770000, cpu_usage=99.9%
split/CPCG0269/CPCG0269_split4_SECT.fasta
[ 2023-08-13 02:06:02 ] moPepGen summarizeFasta started
[ 2023-08-13 02:06:04 ] Reference indices loaded.
[ 2023-08-13 02:06:31 ] rss=170 MiB, vms=751.6 MiB, wallclock=0:00:28.360000, system=0:00:01.630000, cpu_usage=99.9%
split/CPCG0269/CPCG0269_split4_circRNA-Noncoding.fasta
[ 2023-08-13 02:06:31 ] moPepGen summarizeFasta started
[ 2023-08-13 02:06:33 ] Reference indices loaded.
[ 2023-08-13 02:07:00 ] rss=170.9 MiB, vms=752.5 MiB, wallclock=0:00:29.020000, system=0:00:01.800000, cpu_usage=99.8%
split/CPCG0269/CPCG0269_split4_circRNA.fasta
[ 2023-08-13 02:07:01 ] moPepGen summarizeFasta started
[ 2023-08-13 02:07:02 ] Reference indices loaded.
[ 2023-08-13 02:10:42 ] rss=183.5 MiB, vms=765.3 MiB, wallclock=0:03:38.200000, system=0:00:04.590000, cpu_usage=100.0%
zhuchcn commented 11 months ago

Turns out the cause is the memory mapped GenomicAnnotationOnDisk, that transcript information are retreived from disk over and over again. So I get ride of that and only keep the useful information in memory. This should improve it.

zhuchcn commented 11 months ago

Yup super fast now!