vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
266 stars 53 forks source link

protein molecular weight? #1010

Open jingjing1030 opened 4 months ago

jingjing1030 commented 4 months ago

Hi diann team, can I get protein molecular weight from Diann output file?

Thanks,

Jing

vdemichev commented 4 months ago

Hi Jing,

No, no such function. Here I would recommed to load the FASTA file in R or Python with some specialised package, and see if there are some packages for calculating the molecular weight.

Best, Vadim

fstein commented 3 months ago

Dear Vadim,

I am sorry, but shouldn't you as the DIANN developer include the molecular weight into your output tables of DIANN in future versions? Most other software tools offer this and I would consider this a standard feature. I would also like to see a proper protein description column in the outputs (or at least a column with the complete FASTA header so that one can parse this information later). Moreover, the position of the identified peptide within the protein sequence is also missing and highly valuable. Thanks a lot for your consideration.

Best,

Frank

vdemichev commented 3 months ago

Hi Frank,

In theory, this is of course possible. However there is the following consideration here. FASTA information is trivial to extract in R or Python: just use some package for handling protein FASTAs and then match protein IDs in the DIA-NN report to those in the FASTA. This is several lines of R code. While we can of course implement this in DIA-NN, but then how should it be reported? Even inferred protein groups often contain multiple IDs, then for all of those output FASTA headers, concatenate and put in a particular report column? This will inflate report size, which is relevant with huge experiments even for .parquet format. On the other hand, as noted above, it's trivial for the user to get that info if they need. That's why at the moment we don't do this in DIA-NN. What we do plan to do is write some R tutorials about how to do things.

Position of the peptide: often requested feature. Less demanding in terms of report size, so we will very likely add this. Also a bit challenging in situations when peptide is matched to multiple positions within some proteins in a protein group - the format then becomes quite a pain to parse for the user in R anyway, while still not much code to just match peptides to protein sequences using some FASTA-reading R package.

Best, Vadim

fstein commented 3 months ago

Hi Vadim,

thanks for your super quick reply. I agree with you that it's a very simple task to do. It would be even simpler, if you would output a small fasta file with only the proteins you identified into the same folder as you output the report.tsv file. This would make my life much simpler as the FASTA file is very often not stored at the location of the report.tsv output file. I also like the idea to have some R tutorials for users as this would enable them to dig deeper into their data. My only consideration would be if many people now create their own post-analysis scripts to get more or less the same output parameter, wouldn't it safe a lot of total developing time if you would include certain things into the standard output of DiaNN? You could for example have another outputfile which only annotates the identified peptides with additional information such as moleculare weight, hydrophobicity, isoelectric point, position in the sequence etc. So people could use this file as a look-up table in case they are interested and it would not inflate any other file. Here you could also report these values for all matched proteins, or only the proteins with the highest evidence (similar to the occam's razor).

But would you consider including the fasta header into e.g. the report.tsv file? This way, if FASTA files with different header formats are used, you don't have to care somuch about proper parsing.

Best, Frank

vdemichev commented 3 months ago

Hi Frank,

This is actually a superb idea to output a FASTA I think!!! About peptide properties - yes, I guess this is also a great idea, a lot easier than including the main report. Many thanks for the suggestions, will add to the todo list.

Best, Vadim

fstein commented 3 months ago

Hi Vadim,

Thanks a lot. I am glad you like the suggestions :-)

Best,

Frank