DIA-NN1.8.1: Abnormal search results on different platforms

huangcx1539 commented 2 months ago

Hello :

We employed DIA-NN 1.8.1 to obtain results for the same data on different platforms (Windows / Ubuntu), and all processes were completed without any errors. Nevertheless, we discovered some abnormal phenotypes. The results on Windows were significantly lower, and there is a table generated by DIA-NN (report.stats.tsv). Is there any difference between DIA-NN (1.8.1) for different platforms?

There is params for command: Ubuntu 1.8.1 /usr/diann/1.8.1/diann-1.8.1 --f XX.d
--lib "" --threads 80 --verbose 1 --out "XX_test_report.tsv" --qvalue 0.01 --matrices --out-lib "XX_test_report-lib.tsv" --gen-spec-lib --predictor --fasta "/root/Fasta/R_proteins.fasta" --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K,R --missed-cleavages 1 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --unimod4 --reanalyse --relaxed-prot-inf --smart-profiling --peak-center --no-ifs-removal

Windows 1.8.1 diann.exe --f XX.d --lib --threads 52 --verbose 1 --out XX_test_report.tsv --qvalue 0.01 --matrices --out-lib XX_test_report-lib.tsv --gen-spec-lib --predictor --fasta F:\FASTA\R_proteins.fasta --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K,R --missed-cleavages 1 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --unimod4 --reanalyse --relaxed-prot-inf --smart-profiling --peak-center --no-ifs-removal

Best wishes

vdemichev commented 2 months ago

Hi,

I will need full logs to be able to comment.

One note so far: --f XX.d should not be used in combination with --fasta-search, i.e. in silico digest and predicted library generation need to be done in a separate step.

Best Regards, Vadim

huangcx1539 commented 2 months ago

Hi Vadim!

When I check the log files, I find some difference in processing Fasta. there is the detials and the predicted.speclib results could be repeate (There was no difference in the silico library generation when I tested it with other Fasta files).

Windows 14 files will be processed [0:00] Loading FASTA F:\LYY\FASTA\R_proteins.fasta [0:17] Processing FASTA [0:29] Assembling elution groups [0:52] 4388908 precursors generated [0:52] Protein names missing for some isoforms [0:52] Gene names missing for some isoforms [0:52] Library contains 62681 proteins, and 62681 genes [0:54] [1:07] [19:31] [21:55] [22:01] [22:09] Saving the library to F:\LYY\20240911_MissingValue_test_report-lib.predicted.speclib [22:19] Initialising library

Ubuntu 14 files will be processed [0:00] Loading FASTA /root/Fasta/R_proteins.fasta [0:05] Processing FASTA [0:09] Assembling elution groups [0:16] 3721354 precursors generated [0:16] Protein names missing for some isoforms [0:16] Gene names missing for some isoforms [0:16] Library contains 62644 proteins, and 62644 genes [0:20] [0:26] [1:50] [1:58] [2:03] [2:05] Saving the library to /root/rawdata/20240910_MissingValue_RAW/20240911_MissingValue_test_report-lib.predicted.speclib [2:08] Initialising library

For same data, when in firset search,

Windows [22:23] First pass: generating a spectral library from DIA data [22:23] File #1/14 [22:23] Loading run G:\20240910_MissingValue_RAW\P0031_TOF4_DIA_7798.d For most diaPASEF datasets it is better to manually fix both the MS1 and MS2 mass accuracies to values in the range 10-15 ppm. [23:48] 3079354 library precursors are potentially detectable [23:48] Processing... [30:38] RT window set to 5.48762 [30:38] Ion mobility window set to 0.0439195 [30:38] Peak width: 4.16 [30:38] Scan window radius set to 9 [30:38] Recommended MS1 mass accuracy setting: 14.8554 ppm [42:08] Optimised mass accuracy: 11.8946 ppm [72:27] Removing low confidence identifications [72:27] Removing interfering precursors [72:33] Training neural networks: 99067 targets, 59537 decoys [72:39] Number of IDs at 0.01 FDR: 43848 [72:40] Calculating protein q-values [72:41] Number of genes identified at 1% FDR: 1941 (precursor-level), 1710 (protein-level) (inference performed using proteotypic peptides only) [72:41] Quantification

Second pass: [688:23] 199558 library precursors are potentially detectable [688:23] Processing... [688:47] RT window set to 2.03609 [688:47] Ion mobility window set to 0.0365781 [688:47] Recommended MS1 mass accuracy setting: 15.1987 ppm [689:51] Removing low confidence identifications [689:51] Removing interfering precursors [690:00] Training neural networks: 187618 targets, 195801 decoys [690:22] Number of IDs at 0.01 FDR: 62708 [690:24] Calculating protein q-values [690:24] Number of genes identified at 1% FDR: 2530 (precursor-level), 2292 (protein-level) (inference performed using proteotypic peptides only)

Ubuntu [2:11] First pass: generating a spectral library from DIA data [2:11] File #1/14 [2:11] Loading run /root/rawdata/P0031_TOF4_DIA_7798.d For most diaPASEF datasets it is better to manually fix both the MS1 and MS2 mass accuracies to values in the range 10-15 ppm. [2:36] 2599570 library precursors are potentially detectable [2:36] Processing... [4:08] RT window set to 5.86977 [4:08] Ion mobility window set to 0.0445228 [4:08] Peak width: 4.1 [4:08] Scan window radius set to 9 [4:09] Recommended MS1 mass accuracy setting: 14.6782 ppm [7:10] Optimised mass accuracy: 15.0031 ppm [13:39] Removing low confidence identifications [13:39] Removing interfering precursors [13:42] Training neural networks: 86350 targets, 53610 decoys [13:45] Number of IDs at 0.01 FDR: 38544 [13:46] Calculating protein q-values [13:46] Number of genes identified at 1% FDR: 2822 (precursor-level), 2626 (protein-level) (inference performed using proteotypic peptides only) [13:46] Quantification

Second pass [141:40] 175856 library precursors are potentially detectable [141:40] Processing... [141:47] RT window set to 2.03689 [141:47] Ion mobility window set to 0.0366497 [141:47] Recommended MS1 mass accuracy setting: 14.2302 ppm [142:02] Removing low confidence identifications [142:02] Removing interfering precursors [142:06] Training neural networks: 165832 targets, 173465 decoys [142:12] Number of IDs at 0.01 FDR: 55632 [142:13] Calculating protein q-values [142:13] Number of genes identified at 1% FDR: 3819 (precursor-level), 3569 (protein-level) (inference performed using proteotypic peptides only)

Fasta file for human:

Windows [0:00] Loading FASTA D:\PD-methods\211203-uniprot-human-filtered-reviewed-20375-ID-iRT.fasta [0:03] Processing FASTA [0:11] Assembling elution groups [0:17] 4288856 precursors generated [0:18] Gene names missing for some isoforms [0:18] Library contains 20351 proteins, and 20134 genes [0:18] Encoding peptides for spectra and RTs prediction [0:26] Predicting spectra and IMs [14:48] Predicting RTs [16:37] Decoding predicted spectra and IMs [16:42] Decoding RTs [16:47] Saving the library to F:\HCX\E480\Result\report_lib.predicted.speclib [16:53] Initialising library

Ubuntu 3 files will be processed [0:00] Loading FASTA /root/Fasta/211203-uniprot-human-filtered-reviewed-20375-ID-iRT.fasta [0:03] Processing FASTA [0:09] Assembling elution groups [0:13] 4288856 precursors generated [0:13] Gene names missing for some isoforms [0:13] Library contains 20351 proteins, and 20134 genes [0:15] [0:22] [3:34] [3:49] [3:53] [3:55] Saving the library to /root/rawdata/DIANN-Test/E480/report-lib.predicted.speclib [3:58] Initialising library

Best wishes

vdemichev commented 2 months ago

Can you please share the full logs, that is including the command used to launch DIA-NN as well as its full output?

huangcx1539 commented 2 months ago

I need to wait for permission from the data owner, and if they agree, I will send the log file.

vdemichev commented 2 months ago

At least the full header (with --f names removed) and the beginning of DIA-NN output would be helpful, if you could share

huangcx1539 commented 2 months ago

Ubuntu Logs Ubuntu_report.log.txt

huangcx1539 commented 2 months ago

Windows Logs Windows_report.log.txt

vdemichev commented 2 months ago

Indeed a strange observation. I will take a look, but this can take time. Thank you for the logs! Most likely reason: something in the formatting of the FASTA files makes it read differently by the Windows & Linux code.

Best, Vadim

huangcx1539 commented 2 months ago

I compared the format of the fasta files, which originally ended with CRLF as line terminators. After converting the fasta files with dos2unix, although there were no significant changes in the content, I found that Ubuntu could reproduce the results from Windows and also displayed a warning message that had not been seen before (WARNING: 64458 sequences skipped due to duplicate protein ids; use --duplicate-proteins to disable skipping duplicates).

In your opinion, is it necessary to convert the fasta files before using them in Ubuntu? From the current results, it seems that after the conversion, there is a noticeable decrease in protein identification results.

Convert root@huanan:~/Fasta# dos2unix R_sinicus_proteins.fasta dos2unix: converting file R_sinicus_proteins.fasta to Unix format...

root@huanan:~/Fasta# wc -l R_proteins.fasta R_proteins.fasta.copy 803372 R_proteins.fasta 803372 R_proteins.fasta.copy 1606744 total (R_proteins.fasta.copy is the file before convert)

Logs: 1 files will be processed [0:00] Loading FASTA /root/Fasta/R_proteins.fasta [0:04] Loading FASTA /root/Fasta/R_proteins.fasta WARNING: 64458 sequences skipped due to duplicate protein ids; use --duplicate-proteins to disable skipping duplicates [0:06] Processing FASTA [0:12] Assembling elution groups [0:20] 4388908 precursors generated [0:20] Protein names missing for some isoforms [0:20] Gene names missing for some isoforms [0:20] Library contains 62681 proteins, and 62681 genes [0:23] Encoding peptides for spectra and RTs prediction

Best wishes

vdemichev commented 2 months ago

In your opinion, is it necessary to convert the fasta files before using them in Ubuntu?

I think FASTAs downloaded from UniProt work fine without conversion.

Would be fantastic if you could share the FASTA you were using (the one producing different results), I would then take a look if this still occurrs with 1.9 and will then add code to fix it if it does.

huangcx1539 commented 2 months ago

I tested other files. When it was downloaded from Uniprot, it initially ended with LF as the line terminator. If you use unix2dos or other commands to convert (form LF to CRLF), you will find that the results of generating precursors are different. In the present experiment, the conversion of Fasta format does not affect the operation of program, but it does affect the results. If you discovered which format is more suitable for running, please also give us a final suggestion.

Human Fasta

Ended with LF as the line terminator [0:03] Processing FASTA [0:09] Assembling elution groups [0:13] 4288856 precursors generated [0:13] Gene names missing for some isoforms [0:13] Library contains 20351 proteins, and 20134 genes

Ended with CRLF as the line terminator [0:00] Loading FASTA /root/Fasta/Test_Human.fasta [0:03] Processing FASTA [0:08] Assembling elution groups [0:13] 3265722 precursors generated [0:13] Gene names missing for some isoforms [0:13] Library contains 20322 proteins, and 20110 genes

Best wishes

vdemichev commented 2 months ago

So CRLF does not work properly basically, and since UniProt is LF, it's fine. Many thanks for the info, we will check if this still manifests in 1.9.1 and will fix if it does.

Best, Vadim

vdemichev / DiaNN

DIA-NN1.8.1: Abnormal search results on different platforms #1168