single job has been running for 11 hours?

lydiayliu commented 2 years ago

I almost didn't notice this. For the meta pipeline run of 22 hours, I have 373 samples that have databases produced. There seems to be one that is hanging for 11 hours?? That accounts for the last 3 samples that need to run

yiyangliu@ip-0A125212:~$ squeue -u yiyangliu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                87        F2 CCLE-mpg yiyangli  R   22:34:27      1 CZOHHPCSLURMPOC01-F2-10
               200       F72 nf-call_ yiyangli  R   11:14:09      1 CZOHHPCSLURMPOC01-F72-5

don't really know how to troubleshoot this... Here's the log /hot/project/algorithm/moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/GRCh38-EBI-GENCODE34/log/CCLE.log

workdir is here /hot/project/algorithm/moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/GRCh38-EBI-GENCODE34/work/

zhuchcn commented 2 years ago

My first guess that nextflow failed to publish the output files to publishDir. Not 100% sure, but it could be that I'm using "move" here. I had similar issue before with "move". Maybe changing it to "copy" would resolve it.

https://github.com/uclahs-cds/pipeline-meta-call-NonCanonicalPeptide/blob/7827510dfd87d885f4fad3567f29487398ddb821/main.nf#L111

lydiayliu commented 2 years ago

soo i'll just run the 3 samples again separately? sad that it was so close to being perfect XD

I still need to run the fasta entry runs, can you fix this before i do that?

zhuchcn commented 2 years ago

I think we can just change it to "copy"

lydiayliu commented 2 years ago

I tried again with these 3 samples, running just the 3 before and after we changed move to copy. Both are still going...

yiyangliu@ip-0A125212:~$ squeue -u yiyangliu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               284        F2 CCLE-mpg yiyangli  R    1:31:47      1 CZOHHPCSLURMPOC01-F2-10
               287        F2 CCLE-mpg yiyangli  R      34:54      1 CZOHHPCSLURMPOC01-F2-15
               285       F72 nf-call_ yiyangli  R    1:31:33      1 CZOHHPCSLURMPOC01-F72-5
               288       F72 nf-call_ yiyangli  R      30:24      1 CZOHHPCSLURMPOC01-F72-3

From the log it seems like one of the samples is not being run: /hot/project/algorithm/moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/GRCh38-EBI-GENCODE34/work/96/f58e5126055ad80aa01132d25483e8/.command.log

ACH-000670 is the problematic sample

lydiayliu commented 2 years ago

I'm running MPG callVariant directly on the GVF of this sample

moPepGen callVariant \
    --index-dir /hot/users/yiyangliu/MoPepGen/Index/GRCh38-EBI-GENCODE34/ \
    -i /hot/users/yiyangliu/MoPepGen/Random/ACH-000670_starfusion.gvf /hot/users/yiyangliu/MoPepGen/Random/ACH-000670_vep_gencode.gvf \
    --verbose-level 2 \
    --threads 16 \
    -o /hot/users/yiyangliu/MoPepGen/Random/ACH-000670_variant_peptides.fasta

It's been 60+ minutes on one of these transcripts, not the most promising XD [ 2022-05-28 01:28:56 ] ['ENST00000519984.1', 'ENST00000519529.1', 'ENST00000519503.5', 'ENST00000256412.8', 'ENST00000522298.1', 'ENST00000520193.1', 'ENST00000519301.6', 'ENST00000652698.1', 'ENST00000651149.1', 'ENST00000650866.1', 'ENST00000650856.1', 'ENST00000520407.5', 'ENST00000523534.5', 'ENST00000651335.1', 'ENST00000631040.2', 'ENST00000523079.5'] I cancelled my other jobs cuz I think we know where the problem is

zhuchcn commented 2 years ago

Seems like there is a lot of fusions in those transcripts. Those transcripts are probably from the same gene. I'll let keep it running over night to see how it goes.

lydiayliu commented 2 years ago

Still going! Though it used to be using 9 CPUs yesterday and we are down to 8

9a2883c85f7c   sweet_shockley   800.13%   15.21GiB / 62.76GiB   24.23

lydiayliu commented 2 years ago

It finished! Took 16 hours lol. Is it something that is worth investigating?

[ 2022-05-28 01:28:56 ] ['ENST00000519984.1', 'ENST00000519529.1', 'ENST00000519503.5', 'ENST00000256412.8', 'ENST00000522298.1', 'ENST00000520193.1', 'ENST00000519301.6', 'ENST
00000652698.1', 'ENST00000651149.1', 'ENST00000650866.1', 'ENST00000650856.1', 'ENST00000520407.5', 'ENST00000523534.5', 'ENST00000651335.1', 'ENST00000631040.2', 'ENST000005230
79.5']
[ 2022-05-28 17:58:06 ] ['ENST00000650919.1', 'ENST00000356819.7', 'ENST00000651807.1', 'ENST00000650967.1', 'ENST00000652588.1', 'ENST00000521670.5', 'ENST00000287842.7', 'ENST
00000650980.1', 'ENST00000405005.7', 'ENST00000651175.1', 'ENST00000650964.1', 'ENST00000520073.5', 'ENST00000523358.5', 'ENST00000523187.5', 'ENST00000518036.5', 'ENST000003281
95.8']

zhuchcn commented 2 years ago

The problem is, here when cleaving a PVGNode, the first cleavage site is first found, cleave it, and look for the next site. The function find_first_cleave_or_stop_site here is called multiple times. Although it's a generator, but since the node sequence got changed every time, it is then still very inefficient. I'm opening a PR right now.

https://github.com/uclahs-cds/private-moPepGen/blob/607f263085477eedf317cdc1177937edfedd7c48/moPepGen/svgraph/PeptideVariantGraph.py#L107-L112

lydiayliu commented 2 years ago

I'm surprised that the update to 0.5.0 seemed to reveal this, was it because the "pure" fusion was just not considered before?

zhuchcn commented 2 years ago

It's just because the transcript is so big. Most of the time was spent on cleaving the giant node.

uclahs-cds / package-moPepGen

single job has been running for 11 hours? #464