uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
6 stars 1 forks source link

Some more peculiarities with `summarizeFasta` #811

Closed lydiayliu closed 11 months ago

lydiayliu commented 1 year ago

So this is how we are splitting CPCG now

moPepGen splitFasta \
    --variant-peptides ${b} \
    --gvf ${c} \
    --output-prefix split/${a}/${a}_split1 \
    --noncoding-peptides /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/noncoding/min.w2f.fa \
    --alt-translation-peptides /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/alt_translation/sect_w2f.fasta \
    --index-dir /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/index/ \
    --group-source ALT:SECT,CodonReassign NotCirc:gSNP,gIndel,sSNV,sIndel,Fusion,altSplice,RNAEdit \
    --order-source ALT,NotCirc,circRNA,Noncoding \
    --max-source-groups 1 \
    --additional-split Noncoding circRNA NotCirc

All of the group level summarizeFasta results look great, and I think I'm going to move forward with this!

One bit of curiosity though, there is 1 circRNA and several Noncoding annotation in NotCirc

CPCG0241_split1_NotCirc.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
gSNP    28079   6800    11093   10186
gIndel  3821    987     1527    1307
sIndel  18      4       7       7
altSplice       114486  30397   45417   38672
Fusion  102     23      41      38
RNAEdit 30498   7099    13333   10066
circRNA 1       0       1       0
Noncoding       9       5       3       1
gSNP-gIndel     644     190     253     201
gSNP-altSplice  2468    610     967     891
gSNP-RNAEdit    1280    135     496     649
gSNP-Noncoding  1       1       0       0
gIndel-altSplice        271     78      107     86
gIndel-RNAEdit  256     53      84      119
gIndel-Noncoding        5       0       2       3
altSplice-RNAEdit       1580    277     722     581
gSNP-gIndel-altSplice   12      5       4       3
gSNP-gIndel-RNAEdit     37      2       16      19

which is not there in the group level summary

CPCG0241_split1_NotCirc.fasta.group.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
NotCirc 183568  46666   74073   62829

There's probably even more edge cases for summarizeFasta but I won't worry about it now

The results are here: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split1/

Note: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split4/ has a few additional anomalies

Originally posted by @lydiayliu in https://github.com/uclahs-cds/project-MissingPeptides-Method/issues/240#issuecomment-1694564886

zhuchcn commented 11 months ago

The one peptide recognized as 'circRNA' by summarizeFasta is actually a circRNA from a noncoding transcript. So this was a summarizeFasta issue, and seems like is already fixed. I rerun splitFasta and summarizeFasta and the results look fine. The output is here:

/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CPCGENE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-08-21_raw/pipeline-meta-call-NonCanonicalPeptide-0.0.1/split/CPCG0241/split5/

zhuchcn commented 11 months ago

The peptide is found in many transcripts. I'm simplifying it for the purpose of illustration. Let's say we have a peptide of this below:

> ENST00000477571.5|SE_24151-25129-25220-34752|SNV-25146-T-G|1 CIRC-ENST00000391878.6-E3-E4-E5|38
LARMVSIS

The first label ENST00000477571.5|SE_24151-25129-25220-34752|SNV-25146-T-G|1 is {'altSplic', 'gSNP'}, and the second one CIRC-ENST00000391878.6-E3-E4-E5|38 is {'circRNA'}. Based on our rules, the second label has higher priority because it has 1 source. But if we specify source groups, both altSplice and gSNP will become NotCirc, and then the source of the first label becomes {'NotCirc'}, and it wins over {'circRNA'} this time. So this is why this peptide is put into the NotCirc tier by splitFasta when using --group-source, but counted as circRNA by summarizeFasta when --group-source is not used.

lydiayliu commented 11 months ago

Interesting! As per discussion, we decided to leave summarizeFasta as is and deal with this problem in msptools annotate

For the record, I also switched the group ordering in splitFast

            group_source = 'ALT:SECT,CodonReassign NotCirc:gSNP,gIndel,sSNV,sIndel,Fusion,altSplice,RNAEdit'
            order_source = 'NotCirc,ALT,Noncoding,circRNA'
            max_source_groups = 1
            additional_split = 'Noncoding circRNA NotCirc'

And now when I summarizeFasta on the split but don't use grouping, there are still a few discrepancies. --order-source gSNP,gIndel,sSNV,sIndel,RNAEdit,Fusion,altSplice,CodonReassign,SECT,Noncoding,circRNA

altSplice-CodonReassign in ALT, this I can't really understand

CPCG0100_split_ALT.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
CodonReassign   424659  106430  168170  150059
SECT    69      18      30      21
altSplice-CodonReassign 2       0       1       1
CodonReassign-SECT      30      5       12      13

circRNA in Noncoding-additional

CPCG0100_split_Noncoding-additional.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
gSNP-Noncoding  197674  56489   78024   63161
gSNP-circRNA    64      12      33      19
gIndel-Noncoding        36260   11290   13876   11094
gIndel-circRNA  23      4       10      9

I think I understand these though, because if the peptides had NotCirc + NotCirc + Noncoding it would be in this group, but is now assigned as variant + circRNA because of the combination of 2

The Noncoding that shows up in NotCirc should be the same reason

CPCG0100_split_NotCirc-additional.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
gSNP-CodonReassign      7510    1702    2997    2811
gIndel-CodonReassign    2145    434     885     826
sSNV-CodonReassign      8       1       3       4
RNAEdit-CodonReassign   13507   3031    5359    5117
RNAEdit-SECT    3       1       1       1
Fusion-CodonReassign    5       1       2       2
altSplice-CodonReassign 66819   19907   24847   22065
CodonReassign-Noncoding 7       5       2       0
CPCG0100_split_NotCirc.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
gSNP    26582   6488    10559   9535
gIndel  3913    1010    1560    1343
sSNV    36      10      12      14
sIndel  6       1       2       3
RNAEdit 30055   6926    11853   11276
Fusion  63      15      24      24
altSplice       144901  38670   57436   48795
Noncoding       12      8       3       1
zhuchcn commented 11 months ago

Yeah exactly. This is complicated.

altSplice-CodonReassign in ALT, this I can't really understand

CPCG0100_split_ALT.fasta.txt
sources n_total n_0_misc        n_1_misc        n_2_misc
CodonReassign   424659  106430  168170  150059
SECT    69      18      30      21
altSplice-CodonReassign 2       0       1       1
CodonReassign-SECT      30      5       12      13

This is when the peptide has a label of {'SECT', 'CodonReassign'} and another label of {'altSplice', 'CodonReassign'}, and the latter has a higher priority without grouping, and they become {'ALT'} and {'NotCirc', 'ALT'} after grouping.

lydiayliu commented 11 months ago

Ahh got it, so they are all for the same (complicated) reasons! I will close this issue and create and new one in msptools since we are not planning to fix summarizeFasta