vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
273 stars 53 forks source link

fasta problem in library preparation #649

Open tasos2310 opened 1 year ago

tasos2310 commented 1 year ago

Hi all, First of all thanks for all your work in DIA-NN, make our life much easier. I faced a problem with some Fasta files that I want to introduced to generate the libraries. My fasta files are protein sequences from NCBI (attached one of them), unfortunatelly the organism proteome is not in the UNIPROT, in the same time we have some fasta files that generated by our data. Of course in all the above files we follow the Uniprot format. The issue is that in a file that contains 14990 proteins the software can read only 1977, from the opossite hand if I upload a fasta file from Uniprot everything works fine. I assume taht is a matter of format. As it for my workflow are DIA data from timTof, that I converted to mzML. Any advice will be extremelly helpfull for me. Thank you in advance, Best regards, Tasos sequence (1).txt

vdemichev commented 1 year ago

Hi Tasos,

DIA-NN should read sequence IDs correctly, and also will correctly perform protein grouping if protein inference strategy is set to 'isoforms'. So what will remain is to annotate those protein groups in DIA-NN report using some kind of R or Python script.

timsTOF data should be read by DIA-NN natively, will not work with mzML.

Best, Vadim

tasos2310 commented 1 year ago

Hi Vadim, Thank you for your quick response. I understand the strategy with the isoforms because the final goal of my study is to detect different isoforms, so I use it in the specific fasta files that I have for this propose. But, in the case of the example that I gave you is just protein sequence list from NCBI. This was the reason that I worried about. I will try to the strategy with isoforms too. As it for the data that I run, I run them in the beginning with .d files but in this case, I just try to change file format if for a reason something there was the problem. Best, Tasos


From: Vadim Demichev @.> Sent: Thursday, March 30, 2023 11:10 AM To: vdemichev/DiaNN @.> Cc: Anastasios Samaras @.>; Author @.> Subject: Re: [vdemichev/DiaNN] fasta problem in library preparation (Issue #649)

Hi Tasos,

DIA-NN should read sequence IDs correctly, and also will correctly perform protein grouping if protein inference strategy is set to 'isoforms'. So what will remain is to annotate those protein groups in DIA-NN report using some kind of R or Python script.

timsTOF data should be read by DIA-NN natively, will not work with mzML.

Best, Vadim

— Reply to this email directly, view it on GitHubhttps://github.com/vdemichev/DiaNN/issues/649#issuecomment-1490719683, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AZBJOL5KBWOY7LYKU2R7TXLW6XD77ANCNFSM6AAAAAAWMRH72I. You are receiving this because you authored the thread.Message ID: @.***>

vdemichev commented 1 year ago

I does not work as suggested, please share the DIA-NN log, but in theory it should be fine.

tasos2310 commented 1 year ago

Thank you, Is running right now, when I have the log file, I will attach it. Best, Tasos


From: Vadim Demichev @.> Sent: Thursday, March 30, 2023 11:44 AM To: vdemichev/DiaNN @.> Cc: Anastasios Samaras @.>; Author @.> Subject: Re: [vdemichev/DiaNN] fasta problem in library preparation (Issue #649)

I does not work as suggested, please share the DIA-NN log, but in theory it should be fine.

— Reply to this email directly, view it on GitHubhttps://github.com/vdemichev/DiaNN/issues/649#issuecomment-1490759696, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AZBJOL4ISBFWPGB7QMZIIV3W6XH7BANCNFSM6AAAAAAWMRH72I. You are receiving this because you authored the thread.Message ID: @.***>

tasos2310 commented 1 year ago

Hi Vadim, Finally seems that I have problem with my data, I get errors to all my files. At least the library seems that have the numbers of proteins that indeed I expect. I will try to clarify the problem with the files from the institute that I received them. If you have any suggestions, please let me know. Best, Tasos


From: Anastasios Samaras @.> Sent: Thursday, March 30, 2023 11:46 AM To: vdemichev/DiaNN @.> Subject: Re: [vdemichev/DiaNN] fasta problem in library preparation (Issue #649)

Thank you, Is running right now, when I have the log file, I will attach it. Best, Tasos


From: Vadim Demichev @.> Sent: Thursday, March 30, 2023 11:44 AM To: vdemichev/DiaNN @.> Cc: Anastasios Samaras @.>; Author @.> Subject: Re: [vdemichev/DiaNN] fasta problem in library preparation (Issue #649)

I does not work as suggested, please share the DIA-NN log, but in theory it should be fine.

— Reply to this email directly, view it on GitHubhttps://github.com/vdemichev/DiaNN/issues/649#issuecomment-1490759696, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AZBJOL4ISBFWPGB7QMZIIV3W6XH7BANCNFSM6AAAAAAWMRH72I. You are receiving this because you authored the thread.Message ID: @.***>

diann.exe --f "D:\timTofDIA3.28.2023\240323_tasos_Dia30spd_All_S1-B1_1_4245.d " --f "D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f1_S1-B2_1_4246.d " --f "D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f2_S1-B3_1_4247.d " --f "D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f3_S1-B4_1_4248.d " --f "D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f4_S1-B5_1_4249.d " --f "D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f5_S1-B6_1_4250.d " --lib "" --threads 4 --verbose 1 --out "G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report.tsv" --qvalue 0.01 --matrices --out-lib "G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report-lib.tsv" --gen-spec-lib --predictor --fasta "D:\Fungi-tomato\UP000004994_4081.fasta\UP000004994_4081.fasta" --fasta "C:\Users\samar\Downloads\cfulv_race5_uniprot_format.fasta" --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K,R --missed-cleavages 1 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --unimod4 --mass-acc 10 --mass-acc-ms1 20 --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 0 --peak-center --no-ifs-removal DIA-NN 1.8.1 (Data-Independent Acquisition by Neural Networks) Compiled on Apr 14 2022 15:31:19 Current date and time: Thu Mar 30 11:31:32 2023 CPU: GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz SIMD instructions: AVX AVX2 AVX512CD AVX512F FMA SSE4.1 SSE4.2 Logical CPU cores: 8 Thread number set to 4 Output will be filtered at 0.01 FDR Precursor/protein x samples expression level matrices will be saved along with the main report A spectral library will be generated Deep learning will be used to generate a new in silico spectral library from peptides provided Library-free search enabled Min fragment m/z set to 200 Max fragment m/z set to 1800 N-terminal methionine excision enabled In silico digest will involve cuts at K,R Maximum number of missed cleavages set to 1 Min peptide length set to 7 Max peptide length set to 30 Min precursor m/z set to 300 Max precursor m/z set to 1800 Min precursor charge set to 1 Max precursor charge set to 4 Cysteine carbamidomethylation enabled as a fixed modification A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step Highly heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers; use with caution for anything else When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones Implicit protein grouping: isoform IDs; this determines which peptides are considered 'proteotypic' and thus affects protein FDR calculation Fixed-width center of each elution peak will be used for quantification Interference removal from fragment elution curves disabled Mass accuracy will be fixed to 1e-05 (MS2) and 2e-05 (MS1) Exclusion of fragments shared between heavy and light peptides from quantification is not supported in FASTA digest mode - disabled; to enable, generate an in silico predicted spectral library and analyse with this library

6 files will be processed [0:00] Loading FASTA D:\Fungi-tomato\UP000004994_4081.fasta\UP000004994_4081.fasta [0:05] Loading FASTA C:\Users\samar\Downloads\cfulv_race5_uniprot_format.fasta [0:11] Processing FASTA [0:29] Assembling elution groups [0:45] 7204366 precursors generated [0:45] Gene names missing for some isoforms [0:45] Library contains 49533 proteins, and 12779 genes [0:46] Encoding peptides for spectra and RTs prediction [1:06] Predicting spectra and IMs [79:18] Predicting RTs [88:09] Decoding predicted spectra and IMs [89:11] Decoding RTs [89:24] Saving the library to G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report-lib.predicted.speclib [90:05] Initialising library

[90:10] First pass: generating a spectral library from DIA data [90:10] File #1/6 [90:10] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia30spd_All_S1-B1_1_4245.d For most diaPASEF datasets it is better to manually fix both the MS1 and MS2 mass accuracies to values in the range 10-15 ppm. ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia30spd_All_S1-B1_1_4245.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia30spd_All_S1-B1_1_4245.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia30spd_All_S1-B1_1_4245.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:13] 0 library precursors are potentially detectable [90:13] Processing... [90:13] Removing low confidence identifications [90:13] Removing interfering precursors [90:13] Too few confident identifications, neural networks will not be used [90:13] Number of IDs at 0.01 FDR: 0 [90:13] Calculating protein q-values [90:13] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:13] Quantification [90:13] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia30spd_All_S1-B1_1_4245.d.quant.

[90:13] File #2/6 [90:13] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f1_S1-B2_1_4246.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f1_S1-B2_1_4246.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f1_S1-B2_1_4246.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f1_S1-B2_1_4246.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:15] 0 library precursors are potentially detectable [90:15] Processing... [90:15] Removing low confidence identifications [90:15] Removing interfering precursors [90:15] Too few confident identifications, neural networks will not be used [90:15] Number of IDs at 0.01 FDR: 0 [90:15] Calculating protein q-values [90:15] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:15] Quantification [90:15] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f1_S1-B2_1_4246.d.quant.

[90:15] File #3/6 [90:15] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f2_S1-B3_1_4247.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f2_S1-B3_1_4247.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f2_S1-B3_1_4247.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f2_S1-B3_1_4247.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:17] 0 library precursors are potentially detectable [90:17] Processing... [90:17] Removing low confidence identifications [90:17] Removing interfering precursors [90:17] Too few confident identifications, neural networks will not be used [90:17] Number of IDs at 0.01 FDR: 0 [90:17] Calculating protein q-values [90:17] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:17] Quantification [90:17] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f2_S1-B3_1_4247.d.quant.

[90:17] File #4/6 [90:17] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f3_S1-B4_1_4248.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f3_S1-B4_1_4248.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f3_S1-B4_1_4248.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f3_S1-B4_1_4248.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:18] 0 library precursors are potentially detectable [90:19] Processing... [90:19] Removing low confidence identifications [90:19] Removing interfering precursors [90:19] Too few confident identifications, neural networks will not be used [90:19] Number of IDs at 0.01 FDR: 0 [90:19] Calculating protein q-values [90:19] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:19] Quantification [90:19] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f3_S1-B4_1_4248.d.quant.

[90:19] File #5/6 [90:19] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f4_S1-B5_1_4249.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f4_S1-B5_1_4249.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f4_S1-B5_1_4249.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f4_S1-B5_1_4249.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:20] 0 library precursors are potentially detectable [90:20] Processing... [90:20] Removing low confidence identifications [90:20] Removing interfering precursors [90:20] Too few confident identifications, neural networks will not be used [90:20] Number of IDs at 0.01 FDR: 0 [90:20] Calculating protein q-values [90:20] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:20] Quantification [90:20] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f4_S1-B5_1_4249.d.quant.

[90:20] File #6/6 [90:20] Loading run D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f5_S1-B6_1_4250.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f5_S1-B6_1_4250.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f5_S1-B6_1_4250.d ERROR: cannot open the raw data folder D:/timTofDIA3.28.2023/240323_tasos_Dia60spd_f5_S1-B6_1_4250.d ERROR: either the dia-PASEF file is damaged or a .dia file produced by an older DIA-NN version has been loaded: performance might be suboptimal, regenerate the .dia file using this DIA-NN version WARNING: incorrectly recorded isolation window margins [90:22] 0 library precursors are potentially detectable [90:22] Processing... [90:22] Removing low confidence identifications [90:22] Removing interfering precursors [90:22] Too few confident identifications, neural networks will not be used [90:22] Number of IDs at 0.01 FDR: 0 [90:22] Calculating protein q-values [90:22] Number of protein isoforms identified at 1% FDR: 0 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only) [90:22] Quantification [90:22] Quantification information saved to D:\timTofDIA3.28.2023\240323_tasos_Dia60spd_f5_S1-B6_1_4250.d.quant.

[90:22] Cross-run analysis [90:22] Reading quantification information: 6 files ERROR: a .quant file was obtained using a different spectral library / different library-free search settings

DIA-NN exited DIA-NN-plotter.exe "G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report.stats.tsv" "G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report.tsv" "G:\My Drive\Davis 2021\postdoc 2021 Davis\EcP2 project\proteomics\timTof_3_38_2023\DIANN_IsoformIDs\report.pdf" PDF report will be generated in the background