vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
259 stars 53 forks source link

custom FASTA database #894

Open Miracheng opened 8 months ago

Miracheng commented 8 months ago

Dear Vadim,

The background is as follows: I am using a custom FASTA database, which is similar to the uniprot format, but has not annotated, with only sequence IDs and missing protein names and genes. Here is an example:

lcl|ORF35_ENST00000650931.1:974:1171|unnamed protein product MKDLNVKTQTIKTLEENLGNTIQDMGTGKYFMTKMPKAVATKAKIDKWNLIKLKSFCTAKKLSSE

1.Why did the "protein group" in the result report not recognize the protein ID of the custom database? I found that it outputs the uniprot ID, and then I tried to choose "protein inference" as "isoform" or "off", but the result did not change. 2.When I changed the uniprot database and compared the results of these two tests, I found that the content of the Protein Group、 Protein ids, and the quantitative results (PG. Quantity, Max LFQ) obtained by both were the same, only the number of output protein names was different. What is the reason for this? I don't understand why databases are different but quantitative results are consistent. 3.If I use the R package to replace or add missing protein IDs, how do I find the corresponding relationship between the ID and the report?

Here’s my log docs: diann.exe --f E:\THC_peptide\MS-DIA\IPX0001444000\Discovery_M\A20180430sunyt_TPD_DIA_b4-12.raw --f E:\THC_peptide\MS-DIA\IPX0001444000\Discovery_M\A20180430sunyt_TPD_DIA_b4-14.raw --lib E:\THC_peptide\MS-DIA\DIA-nn\spectral library\TPD_SPNlibrary_60min_46files_filter20210517_oldos.tsv --threads 4 --verbose 1 --out E:\THC_peptide\MS-DIA\DIA-nn\0102\0102report.tsv --qvalue 0.01 --matrices --out-lib E:\THC_peptide\MS-DIA\DIA-nn\0102\0102.tsv --gen-spec-lib --fasta E:\Select_lncRNA_ORF.fa --met-excision --cut K,R --var-mods 1 --var-mod UniMod:35,15.994915,M --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 0 --peak-center --no-ifs-removal

Thank you for your response, and I wish you a good day.

Kind regards Mira

vdemichev commented 8 months ago

Hi Mira,

I am not sure I fully understand the nature of the issue, please see below.

  1. How does it look like in DIA-NN report and how does it differ from what you would expect?
  2. Is it just protein identifiers that are different? These will not affect protein grouping and hence will not affect the quantities. I.e. can of course rename proteins in any way, this will not have an effect on quantities, which makes sense.
  3. Ideally, just use a FASTA database with all IDs correct there. If not possible for some reason, in R will need to in slico digest the FASTA (I think some regular expression magic can help here, but never tried myself).

Best, Vadim