uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
5 stars 1 forks source link

circ + variants is parallelization working? #347

Closed lydiayliu closed 2 years ago

lydiayliu commented 2 years ago

I'm not sure if parallelization is working for circRNAs... The reason is that I have been running a sample for the better part of today and it hasn't budged... I'm doing this on a single node using a single process. I tried both 16 and 32 threads.

a=/data/Parser/VEP/gencode/gsnp/CPCG0183.gencode.tsv.s.gvf
b=$(basename -- "$a"); echo ${b};
c="${b%%.*}"; echo ${c};
moPepGen callVariant \
    --input-variant /hot/users/yiyangliu/MoPepGen/Parser/CIRCexplorer3/TOPHAT/${c}_IP_quant.txt.1.s.gvf \
        /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gsnp/${b} \
        /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/${b} \
        /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/somaticsniper/${b} \
        /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/pindel/${b} \
    --index-dir /hot/users/yiyangliu/MoPepGen/Index/GRCh38-EBI-GENCODE34/ \
    --verbose-level 1 \
    --threads 16 \
    --noncanonical-transcripts \
    --output-fasta /hot/users/yiyangliu/MoPepGen/Variant/CIRCexplorer3/TOPHAT/circ_ssm/${c}.fasta > /hot/users/yiyangliu/MoPepGen/Variant/CIRCexplorer3/TOPHAT/circ_ssm/${c}.log

killing the process gets this, also the CPU usage from the docker just hovers around 100%

^CTraceback (most recent call last):                                                                                                                                        
  File "/usr/local/bin/moPepGen", line 8, in <module>                                                                                                                       
    sys.exit(main())                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 79, in main                                                                                  
    args.func(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 280, in call_variant_peptide
    results = process_pool.map(wrapper, dispatches)                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/pathos/parallel.py", line 237, in map                                                                                        
    return list(self.imap(f, *args))                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/pathos/parallel.py", line 250, in <genexpr>
    return (subproc() for subproc in list(builtins.map(submit, *args)))               
  File "/usr/local/lib/python3.8/site-packages/ppft/_pp.py", line 124, in __call__    
    self.wait()                                                                       
  File "/usr/local/lib/python3.8/site-packages/ppft/_pp.py", line 137, in wait
    self.lock.acquire()         
KeyboardInterrupt

the entire log just looks like this (this sample used to run in less than 20 minutes prior to GVF indexing and multi-process), it's been quite a few hours

yiyangliu@ip-0A125212:/hot/users/yiyangliu/MoPepGen/Variant/CIRCexplorer3/TOPHAT/circ_ssm$ tail CPCG0183.log
[ 2022-01-16 18:19:52 ] moPepGen callVariant started
[ 2022-01-16 18:21:11 ] Reference indices loaded.

however, I did get CPCG0100 to run through, and it only took a little longer than before so idk...

yiyangliu@ip-0A125212:/hot/users/yiyangliu/MoPepGen/Variant/CIRCexplorer3/TOPHAT/circ_ssm$ tail CPCG0100.log
[ 2022-01-16 17:49:40 ] moPepGen callVariant started
[ 2022-01-16 17:51:00 ] Reference indices loaded.
[ 2022-01-16 18:13:28 ] Variant peptide FASTA file written to disk.

CPCG0100 old log prior to GVF indexing and multi-process:

[ 2021-12-28 18:08:55 ] moPepGen callVariant started
[ 2021-12-28 18:10:27 ] Variant file /data/Parser/CIRCexplorer3/TOPHAT/CPCG0100_IP_quant.txt.1.3ff.gvf loaded.
[ 2021-12-28 18:16:39 ] Variant file /data/Parser/VEP/gencode/gsnp/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-28 18:17:36 ] Variant file /data/Parser/VEP/gencode/gindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-28 18:17:36 ] Variant file /data/Parser/VEP/gencode/somaticsniper/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-28 18:17:36 ] Variant file /data/Parser/VEP/gencode/pindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-28 18:18:00 ] Variant records sorted.
[ 2021-12-28 18:20:16 ] circRNA processed.
[ 2021-12-28 18:20:16 ] Variant peptide FASTA file written to disk.

I'm gonna try with 12 threads and verbose 2 now

lydiayliu commented 2 years ago

honestly i feel the same way about fusions for no reason. CPU usage just hovers around 200% (I gave 16 threads). I don't think the runs are slower than before, but they are just as slow...

before /hot/users/yiyangliu/MoPepGen/Variant/Fusion/fusioncatcher-1.33/variant.ensembl.winu.3f.log

now (still running the last few) /hot/users/yiyangliu/MoPepGen/Variant/Fusion/fusioncatcher-1.33/variant.ensembl.s.nc.log

actually nvm, it is slow but still an improvement from before! the 200% CPU part is probably because a few fusions are much more complex than the others and all the other threads are waiting for them

zhuchcn commented 2 years ago

I also ran the circRNA case, seems like it got stuck at ENST00000484888.5. I agree that the traceback isn't so user-friendly as before but there is limited that I can do.

And for parallelization, since the order doesn't matter any more, maybe we can sort the transcripts according to the complexity so the ones that take long time will all be processed together.