vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

Annotate proteins for spectronaut TSV library #1217

Closed patrick-willems closed 1 month ago

patrick-willems commented 1 month ago

Hey Vadim,

First of all congrats with the 1.9 release - I already re-analyzed lots of older data given the improved performance with the new release.

I had a small question regarding the protein reannotate function within DIA-NN. I am using Spectronaut-formatted predicted libraries looking as:

ModifiedPeptide StrippedPeptide PrecursorCharge PrecursorMz     IonMobility     iRT     ProteinId       RelativeFragmentIntensity       FragmentMz      FragmentType  FragmentNumber  FragmentCharge  FragmentLossType
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00023522      100.07564545    b       1       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00167886      187.10766602    b       2       1       noloss

I activated the --reannotate and specified the correct UniProtKB FASTA, hoping that DIA-NN would assign the proteins for me. It does print in the log that it is reannotating library precursors but then they are not assigned to proteins. Is it possible to add proteins for such library with DIA-NN? Otherwise, I will add the columns myself of course.

Thanks, Patrick

vdemichev commented 1 month ago

Hi Patrick,

Can you please share the library in .tsv format (can be just a single peptide in there, does not need to be a full one) and the DIA-NN log?

Best, Vadim

patrick-willems commented 1 month ago

Hey Vadim,

It seems to be a linux/docker issue rather, I was testing now on windows and there it did reannotate the peptides correctly. So from my side it is totally fine now and I will just use a Windows system, just to be complete here was the library and log on linux:

Lib TSV (one peptide - note that I still need to convert CCS to 1/K0):

ModifiedPeptide StrippedPeptide PrecursorCharge PrecursorMz     IonMobility     iRT     ProteinId       RelativeFragmentIntensity       FragmentMz      Fragme                                        ntType  FragmentNumber  FragmentCharge  FragmentLossType
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00023522      100.07564545    b       1       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00167886      187.10766602    b       2       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.10095927      288.15533447    b       3       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.14474481      387.22375488    b       4       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.18281789      474.25576782    b       5       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.77617210      603.29840088    b       6       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             1.00000000      716.38244629    b       7       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.33714399      815.45086670    b       8       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00000000      120.06547546    y       1       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00683188      219.13389587    y       2       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.00981588      332.21792603    y       3       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.02758598      461.26052856    y       4       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.11619335      548.29260254    y       5       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.03877397      647.36102295    y       6       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.02824988      748.40869141    y       7       1       noloss
VSTVSELVT       VSTVSELVT       1       934.50970227    281.41616821    44.32462692             0.01109457      835.44073486    y       8       1       noloss

My log is below:

DIA-NN 1.9.1 (Data-Independent Acquisition by Neural Networks)
Compiled on Jul 15 2024 09:42:01
Current date and time: Wed Oct 16 17:31:37 2024
Logical CPU cores: 32
Library precursors will be reannotated using the FASTA database
Thread number set to 32
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step
The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs
Mass accuracy will be fixed to 1.5e-05 (MS2) and 1.5e-05 (MS1)
WARNING: MBR turned off, two or more raw files are required

1 files will be processed
[0:00] Loading spectral library /data/9mers_valid_spectronaut.tsv
[2:42] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[3:03] Spectral library loaded: 0 protein isoforms, 0 protein groups and 9266887 precursors in 8731813 elution groups.
[3:03] Loading FASTA /data/UP000005640_9606_07082024.fasta
[3:26] Reannotating library precursors with information from the FASTA database
[3:29] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[3:29] 9266887 precursors generated
[3:29] Library contains 0 proteins, and 0 genes
[3:31] Initialising library
[3:52] Saving the library to /data/9mers_valid_spectronaut.tsv.skyline.speclib

[3:59] File #1/1
[3:59] Loading run /data/T063656_AurEl8_PM8_DIAIMP_CMB-1691_21_GD5_1_10545.d
[4:47] 9266887 library precursors are potentially detectable
[4:49] Processing...
[128:58] RT window set to 2.51954
[128:58] Ion mobility window set to 0.749911
[128:58] Peak width: 5.008
[128:58] Scan window radius set to 11
[129:00] Recommended MS1 mass accuracy setting: 12.0066 ppm
[272:27] Removing low confidence identifications
[272:28] Removing interfering precursors
[272:42] Training neural networks: 11961 targets, 7869 decoys
[272:50] Number of IDs at 0.01 FDR: 4762
[272:56] No protein annotation, skipping protein q-value calculation
[272:56] Quantification
[272:57] Quantification information saved to /data/T063656_AurEl8_PM8_DIAIMP_CMB-1691_21_GD5_1_10545.d.quant

[272:57] Cross-run analysis
[272:57] Reading quantification information: 1 files
[272:58] Quantifying peptides
[272:58] Quantifying proteins
[272:58] No protein annotation, skipping protein q-value calculation
[272:58] No protein annotation, skipping global protein q-value calculation
[272:58] Compressed report saved to /data/report.parquet. Use R 'arrow' or Python 'PyArrow' package to process
[272:58] Writing report
[272:59] Report saved to /data/report.tsv.
[272:59] Saving precursor levels matrix
[272:59] Precursor levels matrix (1% precursor and protein group FDR) saved to /data/report.pr_matrix.tsv.
[272:59] Saving protein group levels matrix
[272:59] Protein group levels matrix (1% precursor FDR and protein group FDR) saved to /data/report.pg_matrix.tsv.
[272:59] Saving gene group levels matrix
[272:59] Gene groups levels matrix (1% precursor FDR and protein group FDR) saved to /data/report.gg_matrix.tsv.
[272:59] Saving unique genes levels matrix
[272:59] Unique genes levels matrix (1% precursor FDR and protein group FDR) saved to /data/report.unique_genes_matrix.tsv.
[272:59] Manifest saved to /data/report.manifest.txt
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create

Best Patrick

vdemichev commented 1 month ago

Hi Patrick,

Thanks for the info! Could you please attach the .tsv as a file? I just tried to debug but it's tricky to copy paste from github while retaining formatting.

Best, Vadim

patrick-willems commented 1 month ago

Hey,

Yes, the first 1000 lines are here: lib_spectronaut.txt

The FASTA was the human reference UniProtKB.

Thanks!

vdemichev commented 1 month ago

Thank you!

vdemichev commented 1 month ago

Looks like non-tryptic lib, I get zero annotations also on Windows. With --cut F,Y,W,M,L,!P specificity I guess identical results (2 genes matched) on Windows and Linux.

Are you sure you get different results on Windows & Linux with identical settings? If yes, would you please be able to share the full lib & settings that cause it? Apologies for so many requests.

If you want --reannotate to just match to anything, please use --cut to enable cuts after arbitrary amino acids (for this listing all AAs in --cut, e.g. --cut A,G,L,I,...).

Best, Vadim

patrick-willems commented 1 month ago

Aah yes, indeed, the cut flag had to be adapted, this was set default to K,R on my windows while I wrongly had it on linux. Thanks for pointing out it and sorry for the inconvenience!

Patrick