vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

ombining --fasta-search (FASTA summary) with raw file in one run caused a huge change in the results of diann 1.9.1 #1163

Closed Gambrian closed 2 months ago

Gambrian commented 2 months ago

Dear Demichev, This is a derivative question of another issue( #1162 ) I had earlier。Before we got the diann parameters from other labs, we searched the database ourselves and compared them using versions 1.8.1 and 1.9.1 (on the same PC). The results were similar to most other cases I saw in the issue, with diann 1.9.1 slightly reducing the number of proteins identified. However, after getting the 1.8.1 parameters from other labs (they modified protein modifications, instrument parameters, etc.), we first repeated the search on 1.8.1. As shown in #1162 , we still had 10% fewer identifications than others, but when we used their similar parameters on 1.9.1, we got nearly twice the number of protein identifications, which scared me, and I reran diann. 1.9.1, but got the same result. These are the parameters of 1.8.1 and 1.9.1 we used at the beginning. 1.8.1 · diann.exe --lib --threads 12 --verbose 1 --out report.tsv --qvalue 0.01 --matrices --temp skin --out-lib mus.tsv --gen-spec-lib --predictor --reannotate --fastamus_musculus.fasta --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K,R --missed-cleavages 2 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --unimod4 --mass-acc 20.0 --mass-acc-ms1 20 --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 1 --peak-center --no-ifs-removal ·

1.9.1 · diann.exe --lib --threads 4 --verbose 1 --outreport.tsv --qvalue 0.01 --matrices --temp Astral_test\1 --out-lib mus.parquet --gen-spec-lib --predictor --reannotate --fasta camprotR_240512_cRAP_20190401_full_tags.fasta --cont-quant-exclude cRAP- --fasta mus_musculus.fasta --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --cut K,R --missed-cleavages 1 --unimod4 --mass-acc 10.0 --mass-acc-ms1 10.0 --peptidoforms --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 1 ·

The result is that 1.9.1 identified about 5% fewer proteins.

These are the parameters of 1.8.1 and 1.9.1 we used later. 1.8.1 · diann.exe --f * (6 files) --lib --threads 16 --verbose 1 --out outpath\report.tsv --qvalue 0.01 --matrices --out-lib outpath\mus.tsv --gen-spec-lib --predictor --reannotate --fasta mus_musculus.fasta --fasta-search --min-fr-mz 150 --max-fr-mz 2000 --met-excision --cut K,R --missed-cleavages 2 --min-pep-len 5 --max-pep-len 52 --min-pr-mz 380 --max-pr-mz 980 --min-pr-charge 1 --max-pr-charge 6 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,n --monitor-mod UniMod:1 --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 1 --peak-center --no-ifs-removal ·

1.9.1

· diann.exe --f * --lib --threads 16 --verbose 1 --out report.tsv --qvalue 0.01 --matrices --temp --out-lib mus.parquet --gen-spec-lib --predictor --reannotate --fasta camprotR_240512_cRAP_20190401_full_tags.fasta --cont-quant-exclude cRAP- --fasta mus_musculus.fasta --fasta-search --min-fr-mz 150 --max-fr-mz 2000 --met-excision --min-pep-len 5 --max-pep-len 52 --min-pr-mz 380 --max-pr-mz 980 --min-pr-charge 1 --max-pr-charge 6 --cut K,R --missed-cleavages 2 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,*n --peptidoforms --reanalyse --relaxed-prot-inf --rt-profiling

·

1.9.1 got twice the number of protein identifications of 1.8.1.

Did 1.9.1 make major changes to some of the parameters I modified later? Is this result normal? Can I use the same parameters to search the library later? If it is normal, what kind of statistics should I use to prove to the reviewer that this is a real optimization, not my mistake. Thank you very much!

vdemichev commented 2 months ago

Hi,

--f ** and --fasta-search must never be used in combination (please see the docs), DIA-NN prints a warning about this. In this case, it seems that using on the fly FASTA digest coupled to peptidoform scoring causes gibberish output (please don't use it) specifically in 1.9.1 (not 1.9) - this will be fixed in 1.9.2. But in general it might not be a good idea to copy 1.8.1 settings, some things do change between versions and settings are not transferrable per se.

If it is normal, what kind of statistics should I use to prove to the reviewer that this is a real optimization

Please follow https://github.com/vdemichev/DiaNN?tab=readme-ov-file#changing-default-settings. We specifically take care of helping the users avoid configuring DIA-NN incorrectly. The first part here is the documentation, which describes in which cases which settings should be used, and recommends to not change the default settings unless it's recommended for a specific experiment type. In addition, DIA-NN does checks internally to see if things make sense, and prints warnings to draw the attention of the user to things it thinks might be off. In DIA-NN 1.9, the warnings are also summarised at the end of the log, i.e. it's impossible to miss them. The warnings are best interpreted like this: (i) some warnings are just information, e.g. letting the user know that at a particular stage, say, calibration did not work - which is fine, so long as it worked later one - then OK to ignore, (ii) some warnings advise to check that things are as intended and/or to change settings - in this case it's recommended to indeed incorporate those checks, (iii) in some cases, like here, DIA-NN just states that whatg is being done is strongly not recommended - which means there's no known scenario in which this might make sense and it needs to be fixed by the user (DIA-NN indicates how).

Best, Vadim

Gambrian commented 2 months ago

I used a spectral library I created by fasta file before, and it worked. Thank you very much. Hopefully I can serve as a warning to other users. I see diann's warning every time, but through my testing, combining --fasta-search (FASTA summary) with raw file in one run improves the number of proteins identified by about 0.3%, a number that fascinates me because it might indicate that I am doing better. But it is not right, and always ignoring warnings, especially those that are strongly recommended by the authors, can lead to mistakes like mine.

Finally, thank you again for your timely help

Best