Closed lydiayliu closed 2 years ago
CPCG0241:
moPepGen callVariant \
--input-path \
/hot/user/yiyangliu/MoPepGen/Parser/REDItools/CPCG0241_candidates.rmsk.GRCh38_annotated.txt.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/CIRCexplorer2/CPCG0241_circularRNA_known.txt.1.s.gvf \ /hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/CPCG0241.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/gsnp/CPCG0241.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/pindel/CPCG0241.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/somaticsniper/CPCG0241.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/RMATS/CPCG0241_ijc5_sjc5.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/Fusion/star-fusion-1.9.1/CPCG0241.s.gvf \
--output-path CPCG0241_variant_peptides.fasta \
--index-dir GRCh38-EBI-GENCODE34 \
--min-length 7 \
--cleavage-rule trypsin \
--max-variants-per-node 7 \
--min-mw 500.0 \
--miscleavage 2 \
--max-length 25 \
--threads 16
One of the transcripts below is causing the problem.
[ 2022-06-06 20:10:35 ] ['ENST00000580133.2', 'ENST00000390284.2', 'ENST00000442890.1', 'ENST00000390285.4', 'ENST00000520107.1', 'ENST00000432623.1', 'ENST00000302273.2', 'ENST00000403807.4', 'ENST00000624350.1', 'ENST00000610778.1', 'ENST00000412149.1', 'ENST00000652112.1', 'ENST00000413293.1', 'ENST00000390290.3', 'ENST00000436538.1', 'ENST00000390294.2']
CPCG0262:
moPepGen callVariant \
--input-path \
/hot/user/yiyangliu/MoPepGen/Parser/REDItools/CPCG0262_candidates.rmsk.GRCh38_annotated.txt.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/CIRCexplorer2/CPCG0262_circularRNA_known.txt.1.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/CPCG0262.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/gsnp/CPCG0262.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/pindel/CPCG0262.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/somaticsniper/CPCG0262.gencode.tsv.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/RMATS/CPCG0262_ijc5_sjc5.s.gvf \
/hot/user/yiyangliu/MoPepGen/Parser/Fusion/star-fusion-1.9.1/CPCG0262.s.gvf \
--output-path CPCG0262_variant_peptides.fasta \
--index-dir GRCh38-EBI-GENCODE34 \
--min-length 7 \
--cleavage-rule trypsin \
--max-variants-per-node 7 \
--min-mw 500.0 \
--miscleavage 2 \
--max-length 25 \
--threads 16
[ 2022-06-06 20:10:35 ] ['ENST00000580133.2', 'ENST00000390284.2', 'ENST00000442890.1', 'ENST00000390285.4', 'ENST00000520107.1', 'ENST00000432623.1', 'ENST00000302273.2', 'ENST00000403807.4', 'ENST00000624350.1', 'ENST00000610778.1', 'ENST00000412149.1', 'ENST00000652112.1', 'ENST00000413293.1', 'ENST00000390290.3', 'ENST00000436538.1', 'ENST00000390294.2']
On the third node: ip-0A125225
There's no log for CPCG0250:
/scratch/aa/d8d9642363adeb9ed5ba7f56f57dbf/.command.log
Also no callVariant log for COCG0248:
/scratch/34/e4b340f7f4fd910c037307b625b05b/.command.log
The cause for CPCG0241 is that the transcript isn't large (117 amino acids), but it has a lot of RNA editing sites (38). The --max-variants-per-node
limits the number of variants per node. But if three consecutive cleavage regions are all hypermutated, the number of combination is still huge. For example, if a node has 7 variants, its downstream cleavage region has 7, and the downstream of the downstream also has 7 variants, the total combinations of variants would be 2 ^ 21 = 2,097,152. I guess we could add a variable like --additional-variants-per-misc
to specify the number of additional variants to consider per miscleavage. For example, if --max-variants-per-node
= 7 and --additional-variants-per-misc
= 2, then the the combinations we will considier would go down to 352,716. I'm trying to see how reasonable time it takes.
Btw, are the reditools result filtered?
I tried to fix this in #477. By setting --additional-variants-per-misc
to 2, CPCG0241 ENST00000390308.2 finished in 7 minutes, and CPCG0262 ENST00000390285.4 finished in 3 minutes. It's not perfect, but the only workaround that I can think of to let the program finish.
Just found that location is missing in our RES labels. They currently like 'RES-C-G' but they should be 'RES-100-C-G'. Will have to fix it..
sorry what does RES
stand for?
RES -> RNA editing sites
Quoting Julie here for the REDITOOLS filtering:
If I remember correctly the sites have been filtered so that the RNA frequency > 0.1 (so 10% of the reads have the event), the event had 0 frequency in the DNA and the position has at least 10 reads coverage in RNA and DNA. The sites have also been annotated with repeatMasker and RefSeq gene ids.
We had some extensive discussions with filtering in https://github.com/uclahs-cds/private-moPepGen/issues/288 and https://github.com/uclahs-cds/private-moPepGen/issues/294
I used all defult params in moPepGen parseREDItools
, which are these
optional arguments:
--min-coverage-alt <number>
Minimal read coverage of alterations to be parsed. (default: 3)
--min-frequency-alt <value>
Minimal frequency of alteration to be parsed. (default: 0.1)
--min-coverage-rna <value>
Minimal read coverage at the alteration site of RNAseq data of reference and all alterations. (default: 10)
--min-coverage-dna <number>
Minimal read coverage at the alteration site of WGS. Set it to -1 to skip checking this. (default: 10)
which gets a ~70k line GVF from a ~290k line input TXT file.
Though it seems like I ran the test with only REDItools using --max-variants-per-node 5
. But if this is too much for mpg we can always change the min coverage
to 30 for rna and dna, that should bring it down substantially?
Yeah right, our default parameters are set based on the recommendation of their paper. Would it be useful to take a look at the distribution of RES number per transcript? Maybe some transcripts indeed have a lot of RES?
/hot/user/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/CPCG0262.gencode.tsv.s.gvf
You mean a previous test run? That explains why we didn't see this before.
Hmm that GVF you are quoting is from VEP... I don't think I ever ran all of the CPCG REDItools through callVariant (maybe I should), but all the GVFs are here:
/hot/users/yiyangliu/MoPepGen/Parser/REDItools/
They are in the 20k-100k lines range.
I ran a single filtered sample was CPCG0196. That does explain why we havn't seen this before.
I'll try to pull some stats in a bit
Oh sorry that's funny. I was trying to quote your last sentence but maybe I got confused between ctrl+shift+c and ctrl+c.
Though it seems like I ran the test with only REDItools using --max-variants-per-node 5.
I ran a single filtered sample was CPCG0196. That does explain why we havn't seen this before.
That makes sense!
I tallied the number of GVF entries per transcript and printed the top 5 from each sample in this file
/hot/user/yiyangliu/MoPepGen/Parser/REDItools/gvf_entries_max_5_transcripts.txt
You can see that the usual culprits are pretty clear (ENST00000356079.9 I'm talking about you!!), going to 300+ variants
OMG it's actually the protein coding transcript of CYP20A1, is it known that it is so heavily RNA-edited???
Try this sample for an upper bound
CPCG0342_candidates.rmsk.GRCh38_annotated.txt.s.gvf
114 ENST00000416600.6
114 ENST00000428216.4
116 ENST00000360845.3
203 ENST00000233190.11
362 ENST00000356079.9
Just found that location is missing in our RES labels. They currently like 'RES-C-G' but they should be 'RES-100-C-G'. Will have to fix it..
LMAO I also just noticed. Well it seems like we need to reparse XD I can increase the stringency of the filtering at the same time, do you think we should?
The last thing is that we can just do --max-variants-per-node 5
for all of CPCG!
Oh I guess what is more useful is the distribution of the coverage of each rna editing site. But it does seem like there are transcripts been highly RNA edited. Might be good to look at the number of records when setting min coverage to 30?
The last thing is that we can just do --max-variants-per-node 5 for all of CPCG!
Sure!
I'll try 30 along with fixing the IDs!
sooooooo the interesting thing is, after going for --min-coverage-rna 30
and --min-coverage-dna 30
, (--min-coverage-dna 30
alone didn't do too much), the top transcript still has 300+ variants in some samples
316 ENST00000356079.9
318 ENST00000356079.9
328 ENST00000356079.9
332 ENST00000356079.9
335 ENST00000356079.9
There are fewer GVF entries now of course, but I've decided to run things through with just REDItools to see if everybody can finish in a reasonable time (also since i cant get any F72 node to run for me so the metapipeline is shot)
update: it seems like the transcript with 300+ variants is not the one causing trouble
All samples up to CPCG0248 has run in like 2-3 minutes (wallclock at the end of callVariant), for CPCG0248 these transcripts have been running for 20+ minutes. So a crube tally of variants per transcript is not reflective of run time XD
[ 2022-06-09 15:41:05 ] ['ENST00000432883.5', 'ENST00000447208.6', 'ENST00000420290.6', 'ENST00000434570.6', 'ENST00000398042.6', 'ENST00000399807.7', 'ENST00000327374.9', 'ENST00000215730.12', 'ENST00000610778.1', 'ENST00000652112.1', 'ENST00000390290.3', 'ENST00000390299.2', 'ENST00000390306.2', 'ENST00000390308.2', 'ENST00000390312.2', 'ENST00000620395.2']
It probably has to do with how many variants are super close together, like all of these are within 300 nucleotides
The culprit is probably one of the below (I see only 1 CPU running)
yiyangliu@ip-0A125212:/hot/users/yiyangliu/MoPepGen/Parser/REDItools$ grep ENST00000390299.2 CPCG0248_candidates.rmsk.GRCh38_annotated.txt.mincd30.gvf
ENSG00000211653.2 321 RES-321-G-A G A . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410086;STRAND=1
ENSG00000211653.2 323 RES-323-A-T A T . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410088;STRAND=1
ENSG00000211653.2 372 RES-372-G-A G A . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410137;STRAND=1
ENSG00000211653.2 373 RES-373-G-C G C . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410138;STRAND=1
ENSG00000211653.2 376 RES-376-A-C A C . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410141;STRAND=1
ENSG00000211653.2 378 RES-378-A-T A T . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410143;STRAND=1
ENSG00000211653.2 386 RES-386-G-A G A . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410151;STRAND=1
ENSG00000211653.2 395 RES-395-G-C G C . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410160;STRAND=1
ENSG00000211653.2 415 RES-415-G-C G C . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410180;STRAND=1
ENSG00000211653.2 431 RES-431-C-G C G . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410196;STRAND=1
ENSG00000211653.2 450 RES-450-A-T A T . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410215;STRAND=1
ENSG00000211653.2 461 RES-461-G-C G C . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410226;STRAND=1
ENSG00000211653.2 479 RES-479-T-A T A . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410244;STRAND=1
ENSG00000211653.2 484 RES-484-A-T A T . . TRANSCRIPT_ID=ENST00000390299.2;GENOMIC_POSITION=chr22:22410249;STRAND=1
yiyangliu@ip-0A125212:/hot/users/yiyangliu/MoPepGen/Parser/REDItools$ grep ENST00000390306.2 CPCG0248_candidates.rmsk.GRCh38_annotated.txt.mincd30.gvf
ENSG00000211660.3 339 RES-339-G-C G C . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698127;STRAND=1
ENSG00000211660.3 421 RES-421-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698209;STRAND=1
ENSG00000211660.3 427 RES-427-A-G A G . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698215;STRAND=1
ENSG00000211660.3 428 RES-428-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698216;STRAND=1
ENSG00000211660.3 435 RES-435-C-T C T . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698223;STRAND=1
ENSG00000211660.3 453 RES-453-A-G A G . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698241;STRAND=1
ENSG00000211660.3 456 RES-456-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698244;STRAND=1
ENSG00000211660.3 458 RES-458-A-C A C . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698246;STRAND=1
ENSG00000211660.3 478 RES-478-C-G C G . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698266;STRAND=1
ENSG00000211660.3 483 RES-483-G-T G T . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698271;STRAND=1
ENSG00000211660.3 488 RES-488-A-T A T . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698276;STRAND=1
ENSG00000211660.3 493 RES-493-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698281;STRAND=1
ENSG00000211660.3 498 RES-498-T-C T C . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698286;STRAND=1
ENSG00000211660.3 501 RES-501-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698289;STRAND=1
ENSG00000211660.3 519 RES-519-T-C T C . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698307;STRAND=1
ENSG00000211660.3 520 RES-520-A-G A G . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698308;STRAND=1
ENSG00000211660.3 531 RES-531-T-C T C . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698319;STRAND=1
ENSG00000211660.3 539 RES-539-A-G A G . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698327;STRAND=1
ENSG00000211660.3 549 RES-549-C-A C A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698337;STRAND=1
ENSG00000211660.3 567 RES-567-C-T C T . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698355;STRAND=1
ENSG00000211660.3 571 RES-571-G-A G A . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698359;STRAND=1
ENSG00000211660.3 572 RES-572-G-T G T . . TRANSCRIPT_ID=ENST00000390306.2;GENOMIC_POSITION=chr22:22698360;STRAND=1
yiyangliu@ip-0A125212:/hot/users/yiyangliu/MoPepGen/Parser/REDItools$ grep ENST00000390312.2 CPCG0248_candidates.rmsk.GRCh38_annotated.txt.mincd30.gvf
ENSG00000211666.2 254 RES-254-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758953;STRAND=1
ENSG00000211666.2 267 RES-267-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758966;STRAND=1
ENSG00000211666.2 275 RES-275-C-T C T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758974;STRAND=1
ENSG00000211666.2 289 RES-289-C-G C G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758988;STRAND=1
ENSG00000211666.2 296 RES-296-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758995;STRAND=1
ENSG00000211666.2 300 RES-300-G-A G A . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22758999;STRAND=1
ENSG00000211666.2 305 RES-305-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759004;STRAND=1
ENSG00000211666.2 306 RES-306-G-A G A . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759005;STRAND=1
ENSG00000211666.2 307 RES-307-G-C G C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759006;STRAND=1
ENSG00000211666.2 309 RES-309-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759008;STRAND=1
ENSG00000211666.2 316 RES-316-A-C A C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759015;STRAND=1
ENSG00000211666.2 337 RES-337-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759036;STRAND=1
ENSG00000211666.2 345 RES-345-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759044;STRAND=1
ENSG00000211666.2 347 RES-347-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759046;STRAND=1
ENSG00000211666.2 355 RES-355-A-C A C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759054;STRAND=1
ENSG00000211666.2 357 RES-357-C-G C G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759056;STRAND=1
ENSG00000211666.2 360 RES-360-A-C A C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759059;STRAND=1
ENSG00000211666.2 367 RES-367-A-T A T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759066;STRAND=1
ENSG00000211666.2 369 RES-369-G-C G C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759068;STRAND=1
ENSG00000211666.2 370 RES-370-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759069;STRAND=1
ENSG00000211666.2 375 RES-375-A-C A C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759074;STRAND=1
ENSG00000211666.2 395 RES-395-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759094;STRAND=1
ENSG00000211666.2 399 RES-399-A-G A G . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759098;STRAND=1
ENSG00000211666.2 405 RES-405-T-C T C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759104;STRAND=1
ENSG00000211666.2 459 RES-459-G-C G C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759158;STRAND=1
ENSG00000211666.2 482 RES-482-C-T C T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759181;STRAND=1
ENSG00000211666.2 488 RES-488-C-T C T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759187;STRAND=1
ENSG00000211666.2 500 RES-500-C-T C T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759199;STRAND=1
ENSG00000211666.2 503 RES-503-C-T C T . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759202;STRAND=1
ENSG00000211666.2 504 RES-504-A-C A C . . TRANSCRIPT_ID=ENST00000390312.2;GENOMIC_POSITION=chr22:22759203;STRAND=1
Currently known:
CPCG0241 CPCG0262
GVFs in here:
/hot/project/algorithm/moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/GRCh38-EBI-GENCODE34/input_csvs/CPCG.csv
If it's really just a run time issue we can always exclude RNA-editing?