Closed lydiayliu closed 11 months ago
The one peptide recognized as 'circRNA' by summarizeFasta is actually a circRNA from a noncoding transcript. So this was a summarizeFasta issue, and seems like is already fixed. I rerun splitFasta and summarizeFasta and the results look fine. The output is here:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split5/
The peptide is found in many transcripts. I'm simplifying it for the purpose of illustration. Let's say we have a peptide of this below:
> ENST00000477571.5|SE_24151-25129-25220-34752|SNV-25146-T-G|1 CIRC-ENST00000391878.6-E3-E4-E5|38
LARMVSIS
The first label ENST00000477571.5|SE_24151-25129-25220-34752|SNV-25146-T-G|1
is {'altSplic', 'gSNP'}
, and the second one CIRC-ENST00000391878.6-E3-E4-E5|38
is {'circRNA'}
. Based on our rules, the second label has higher priority because it has 1 source. But if we specify source groups, both altSplice
and gSNP
will become NotCirc
, and then the source of the first label becomes {'NotCirc'}
, and it wins over {'circRNA'}
this time. So this is why this peptide is put into the NotCirc tier by splitFasta
when using --group-source
, but counted as circRNA
by summarizeFasta
when --group-source
is not used.
Interesting! As per discussion, we decided to leave summarizeFasta
as is and deal with this problem in msptools annotate
For the record, I also switched the group ordering in splitFast
group_source = 'ALT:SECT,CodonReassign NotCirc:gSNP,gIndel,sSNV,sIndel,Fusion,altSplice,RNAEdit'
order_source = 'NotCirc,ALT,Noncoding,circRNA'
max_source_groups = 1
additional_split = 'Noncoding circRNA NotCirc'
And now when I summarizeFasta
on the split but don't use grouping, there are still a few discrepancies.
--order-source gSNP,gIndel,sSNV,sIndel,RNAEdit,Fusion,altSplice,CodonReassign,SECT,Noncoding,circRNA
altSplice-CodonReassign
in ALT
, this I can't really understand
CPCG0100_split_ALT.fasta.txt
sources n_total n_0_misc n_1_misc n_2_misc
CodonReassign 424659 106430 168170 150059
SECT 69 18 30 21
altSplice-CodonReassign 2 0 1 1
CodonReassign-SECT 30 5 12 13
circRNA in Noncoding-additional
CPCG0100_split_Noncoding-additional.fasta.txt
sources n_total n_0_misc n_1_misc n_2_misc
gSNP-Noncoding 197674 56489 78024 63161
gSNP-circRNA 64 12 33 19
gIndel-Noncoding 36260 11290 13876 11094
gIndel-circRNA 23 4 10 9
I think I understand these though, because if the peptides had NotCirc + NotCirc + Noncoding
it would be in this group, but is now assigned as variant + circRNA
because of the combination of 2
The Noncoding
that shows up in NotCirc
should be the same reason
CPCG0100_split_NotCirc-additional.fasta.txt
sources n_total n_0_misc n_1_misc n_2_misc
gSNP-CodonReassign 7510 1702 2997 2811
gIndel-CodonReassign 2145 434 885 826
sSNV-CodonReassign 8 1 3 4
RNAEdit-CodonReassign 13507 3031 5359 5117
RNAEdit-SECT 3 1 1 1
Fusion-CodonReassign 5 1 2 2
altSplice-CodonReassign 66819 19907 24847 22065
CodonReassign-Noncoding 7 5 2 0
CPCG0100_split_NotCirc.fasta.txt
sources n_total n_0_misc n_1_misc n_2_misc
gSNP 26582 6488 10559 9535
gIndel 3913 1010 1560 1343
sSNV 36 10 12 14
sIndel 6 1 2 3
RNAEdit 30055 6926 11853 11276
Fusion 63 15 24 24
altSplice 144901 38670 57436 48795
Noncoding 12 8 3 1
Yeah exactly. This is complicated.
altSplice-CodonReassign
inALT
, this I can't really understandCPCG0100_split_ALT.fasta.txt sources n_total n_0_misc n_1_misc n_2_misc CodonReassign 424659 106430 168170 150059 SECT 69 18 30 21 altSplice-CodonReassign 2 0 1 1 CodonReassign-SECT 30 5 12 13
This is when the peptide has a label of {'SECT', 'CodonReassign'}
and another label of {'altSplice', 'CodonReassign'}
, and the latter has a higher priority without grouping, and they become {'ALT'}
and {'NotCirc', 'ALT'}
after grouping.
Ahh got it, so they are all for the same (complicated) reasons! I will close this issue and create and new one in msptools
since we are not planning to fix summarizeFasta
So this is how we are splitting CPCG now
All of the group level
summarizeFasta
results look great, and I think I'm going to move forward with this!One bit of curiosity though, there is 1
circRNA
and severalNoncoding
annotation inNotCirc
which is not there in the group level summary
There's probably even more edge cases for
summarizeFasta
but I won't worry about it nowThe results are here:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split1/
Note:
/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split4/
has a few additional anomaliesOriginally posted by @lydiayliu in https://github.com/uclahs-cds/project-MissingPeptides-Method/issues/240#issuecomment-1694564886