smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

FragPipe-ready fasta headers and redundancy reduction #221

Open MiguelCos opened 3 years ago

MiguelCos commented 3 years ago

Hello @acesnik ,

I am opening this issue here so I can share some thoughts of what I perceive as some issues with the format of the output fasta file from Spritz to be used in FragPipe as initiated in #https://github.com/Nesvilab/FragPipe/issues/263.

I am already working on an R script to try to solve at least 80% of Problem 1 that I will share here hopefully soon (this week).

Problem 1: the headers. The format does not seem to fit what FragPipe/Philosopher is expecting as a 'mock' of the Uniprot format. On the one hand, I think the mz at the beginning is part of the problem and also the fact that the descriptions of the variant proteins are extremely big.

My solution is to extract all the variant information into a tabular annotation (something like a reduced version of a BED file) and extract a very simple header from there: Code the variant as part of the protein ID section of the header and add a reduced description. The IDs can be then mapped to the 'reduced BED file' afterward to be able to map the variant IDs to their identifiers and annotations.

I also found that some peptide sequences for the variants are appended into the protein/transcript ID section of the header, contributing to a very big header too.

Problem 2: Redundancy

I am trying to describe the problem the best I can here:

The output from spritz looks like this (allow me a pop reference):

>Protein_X1_wt
LADYGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var1
LADAGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var2
LADYGALADRIELGENDALFKTHEGIMLIK

This means that protein/transcript X1 has 3 versions: One WT, and two variants. But each variant is present in a different tryptic peptide.

I would like to have all variants for a protein summarized in one unique 'variant' protein so It would be easier to filter identified variants by their unique peptides and would also reduce the search space. In the end, when identifying sequence variants, our evidence for their existence is the tryptic peptide identification so I don't think it is necessary to have a protein entry for each of the called variants.

>Protein_X1_wt
LADYGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var1_n_var2
LADAGALADRIELGENDALFKTHEGIMLIK

Does it make sense and do you think it is actually a problem?

I'll share here my partial solution to problem 1 as soon as I have it.

Best wishes, Miguel

acesnik commented 3 years ago

Hi @MiguelCos,

Thanks for the message!

Having a lookup table for the variants sounds like a good idea, for sure.

On the redundancy, one thing to be careful about is that Spritz does perform some combinatorics with heterozygous variations. It amends sequences with homozygous variations, and since both the reference and alternate allele could be possible for heterozygous variations, it expands the combinations of those possible peptides. Some of those combinations may be lost if combining all the variants into a single entry.

Anthony

acesnik commented 3 years ago

Are you using combined.spritz.snpeff.protein.fasta or combined.spritz.snpeff.protein.withdecoys.fasta?

MiguelCos commented 3 years ago

Hello Anthony,

I have been using the combined.spritz.snpeff.protein.withdecoys.fasta.

acesnik commented 3 years ago

That's great. Thanks for the info!

MiguelCos commented 3 years ago

Hello Anthony @acesnik

I just finished an R script for adapting the combined.spritz.snpeff.protein.withdecoys.fasta in a format convenient to FragPipe.

https://github.com/MiguelCos/spritz_fasta_2_fragpipe_adaptation

The repo contains a small sample fasta and the sample output.

If you check the annotation file, you will see that I didn't give particularly meaningful names to each of the columns because I am not sure how to refer to each piece of info associated with each variant. Is there any way I can get to know better how to interpret those and what are their actual 'names'?

I used the script on two different datasets and in both cases, Philosopher seemed to parse the fasta properly (it didn't crash when using the LFQ pipeline, and the TMT report tables were properly generated using the TMT pipeline). I need to look a little bit closer, but in general, it seems to be working as it should.

Also, many thanks for your clarification regarding the redundancy 'problem'. It then makes sense to keep the variant sequences as they are!

Best wishes, Miguel