vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
281 stars 53 forks source link

issue with xic parquet path in linux version #1260

Open heejongkim opened 6 days ago

heejongkim commented 6 days ago

Hi,

I just noticed that, in the linux version under the docker, xic folder wasn't generated with --xic argument and found out that the path was set to start with "/" so all xic parquet files gets saved at the root path / xic folder and deleted at the end of run because docker container gets destroyed. Would it be possible if I can manually set the path or remove "/" in the beginning?

Thank you.

best, heejong

vdemichev commented 4 days ago

Hi heejong,

The output for XIC generation should be a folder within the same location as the main report file. Does it work for you if you specify some location the user account has full write access for as the location of the main report?

Best, Vadim

heejongkim commented 4 days ago

Hi Vadim,

Yeah. in Windows version, I see relevant xic folder within the folder that report.tsv was generated but no in linux version. Here's a possibly relevant log excerpt that may be hopefully helpful.

[464:40] XICs saved to /report_xic/[FILENAME].xic.parquet

Thanks. best, heejong

vdemichev commented 4 days ago

How does the whole log look like?

heejongkim commented 4 days ago

Cmd:

docker run -v ${PWD}:/data/ diann-1.9.2 diann-linux --cfg human_noMBR_xic.cfg

Cfg:

--lib human_proteome_library_192.predicted.speclib --threads 250 --verbose 1 --out report.tsv --qvalue 0.01 --matrices --out-lib report-lib.parquet --gen-spec-lib --reannotate --fasta camprotR_240512_cRAP_20190401_full_tags.fasta --cont-quant-exclude cRAP- --fasta human_UP000005640_9606.fasta --met-excision --min-pep-len 7 --max-pep-len 35 --min-pr-mz 300 --max-pr-mz 1050 --min-pr-charge 1 --max-pr-charge 5 --cut K,R --missed-cleavages 3 --unimod4 --relaxed-prot-inf --rt-profiling --xic --f 2024-09-04-Megan-QC-HeLa-100ng_20240905050956.mzML

Log:

DIA-NN 1.9.2 (Data-Independent Acquisition by Neural Networks)
Compiled on Oct 31 2024 04:27:44
Current date and time: Fri Nov 15 15:37:54 2024
Logical CPU cores: 48
Thread number set to 48
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
Library precursors will be reannotated using the FASTA database
Peptides corresponding to protein sequence IDs tagged with cRAP- will be excluded from normalisation as well as quantification of protein groups that do not include proteins bearing the tag
N-terminal methionine excision enabled
Min peptide length set to 7
Max peptide length set to 35
Min precursor m/z set to 300
Max precursor m/z set to 1050
Min precursor charge set to 1
Max precursor charge set to 5
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 3
Cysteine carbamidomethylation enabled as a fixed modification
Heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers, GO/pathway and system-scale analyses
The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs
XICs within 10 seconds from the apex will be extracted for each precursor and saved in .parquet format, a folder will be created next to the main report for the XICs storage
DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.

1 files will be processed
[0:00] Loading spectral library human_proteome_library_192.predicted.speclib
[0:06] Library annotated with sequence database(s): camprotR_240512_cRAP_20190401_full_tags.fasta; human_UP000005640_9606.fasta
[0:08] Spectral library loaded: 21104 protein isoforms, 32374 protein groups and 6891429 precursors in 3058285 elution groups.
[0:08] Loading FASTA camprotR_240512_cRAP_20190401_full_tags.fasta
[0:08] Loading FASTA human_UP000005640_9606.fasta
[0:40] Reannotating library precursors with information from the FASTA database
[0:46] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[0:46] 6891429 precursors generated
[0:46] Gene names missing for some isoforms
[0:46] Library contains 21104 proteins, and 20449 genes
[0:48] Initialising library
WARNING: it is strongly recommended to enable MBR when analysing with a large library, if this is a quantitative analysis

[1:13] File #1/1
[1:13] Loading run 2024-09-04-Megan-QC-HeLa-100ng_20240905050956.mzML
[2:05] 5684934 library precursors are potentially detectable
[2:06] Calibrating with mass accuracies 30 (MS1), 20 (MS2)
[2:25] RT window set to 0.955239
[2:25] Peak width: 2.576
[2:25] Scan window radius set to 5
[2:26] Recommended MS1 mass accuracy setting: 2.57927 ppm
[3:09] Optimised mass accuracy: 8.28205 ppm
[4:40] Removing low confidence identifications
[4:41] Removing interfering precursors
[4:52] Training neural networks on 318247 PSMs
[5:04] Number of IDs at 0.01 FDR: 102617
[5:06] Calculating protein q-values
[5:06] Number of genes identified at 1% FDR: 8330 (precursor-level), 7577 (protein-level) (inference performed using proteotypic peptides only)
[5:06] Quantification
[5:08] Quantification information saved to 2024-09-04-Megan-QC-HeLa-100ng_20240905050956.mzML.quant
[5:13] XICs saved to /report_xic/2024-09-04-Megan-QC-HeLa-100ng_20240905050956.xic.parquet

[5:13] Cross-run analysis
[5:13] Reading quantification information: 1 files
[5:16] Quantifying peptides
[5:16] Assembling protein groups
[5:19] Quantifying proteins
[5:19] Calculating q-values for protein and gene groups
[5:20] Calculating global q-values for protein and gene groups
[5:20] Protein groups with global q-value <= 0.01: 7705
[5:21] Compressed report saved to report.parquet. Use R 'arrow' or Python 'PyArrow' package to process
[5:21] Writing report
[5:23] Report saved to report.tsv.
[5:23] Saving precursor levels matrix
[5:23] Precursor levels matrix (1% precursor and protein group FDR) saved to report.pr_matrix.tsv.
[5:23] Saving protein group levels matrix
[5:23] Protein group levels matrix (1% precursor FDR and protein group FDR) saved to report.pg_matrix.tsv.
[5:23] Saving gene group levels matrix
[5:24] Gene groups levels matrix (1% precursor FDR and protein group FDR) saved to report.gg_matrix.tsv.
[5:24] Saving unique genes levels matrix
[5:24] Unique genes levels matrix (1% precursor FDR and protein group FDR) saved to report.unique_genes_matrix.tsv.
[5:24] Manifest saved to report.manifest.txt
[5:24] Stats report saved to report.stats.tsv
[5:24] Generating spectral library:
[5:26] 102614 target and 1026 decoy precursors saved
[5:26] Spectral library saved to report-lib.parquet

The following warnings or errors (in alphabetic order) were detected at least the indicated number of times:
WARNING: it is strongly recommended to enable MBR when analysing with a large library, if this is a quantitative analysis : 1
Finished

How to cite:
using DIA-NN: Demichev et al, Nature Methods, 2020, https://www.nature.com/articles/s41592-019-0638-x
analysing Scanning SWATH: Messner et al, Nature Biotechnology, 2021, https://www.nature.com/articles/s41587-021-00860-4
analysing PTMs: Steger et al, Nature Communications, 2021, https://www.nature.com/articles/s41467-021-25454-1
analysing dia-PASEF: Demichev et al, Nature Communications, 2022, https://www.nature.com/articles/s41467-022-31492-0
analysing Slice-PASEF: Szyrwiel et al, biorxiv, 2022, https://doi.org/10.1101/2022.10.31.514544
plexDIA / multiplexed DIA: Derks et al, Nature Biotechnology, 2023, https://www.nature.com/articles/s41587-022-01389-w
CysQuant: Huang et al, Redox Biology, 2023, https://doi.org/10.1016/j.redox.2023.102908
using QuantUMS: Kistner at al, biorxiv, 2023, https://doi.org/10.1101/2023.06.20.545604
[5:26] Log saved to report.log.txt

and I don't see report_xic folder in the same location where report.tsv got generated.

best, heejong

vdemichev commented 4 days ago

--out report.tsv

What happens if you specify a particular folder for report.tsv?

heejongkim commented 4 days ago
DIA-NN 1.9.2 (Data-Independent Acquisition by Neural Networks)
Compiled on Oct 31 2024 04:27:44
Current date and time: Fri Nov 15 15:48:56 2024
Logical CPU cores: 48
Thread number set to 48
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
Library precursors will be reannotated using the FASTA database
Peptides corresponding to protein sequence IDs tagged with cRAP- will be excluded from normalisation as well as quantification of protein groups that do not include proteins bearing the tag
N-terminal methionine excision enabled
Min peptide length set to 7
Max peptide length set to 35
Min precursor m/z set to 300
Max precursor m/z set to 1050
Min precursor charge set to 1
Max precursor charge set to 5
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 3
Cysteine carbamidomethylation enabled as a fixed modification
Heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers, GO/pathway and system-scale analyses
The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs
XICs within 10 seconds from the apex will be extracted for each precursor and saved in .parquet format, a folder will be created next to the main report for the XICs storage
DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.

1 files will be processed
[0:00] Loading spectral library human_proteome_library_192.predicted.speclib
[0:05] Library annotated with sequence database(s): camprotR_240512_cRAP_20190401_full_tags.fasta; human_UP000005640_9606.fasta
[0:07] Spectral library loaded: 21104 protein isoforms, 32374 protein groups and 6891429 precursors in 3058285 elution groups.
[0:07] Loading FASTA camprotR_240512_cRAP_20190401_full_tags.fasta
[0:07] Loading FASTA human_UP000005640_9606.fasta
[0:30] Reannotating library precursors with information from the FASTA database
[0:36] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[0:36] 6891429 precursors generated
[0:36] Gene names missing for some isoforms
[0:36] Library contains 21104 proteins, and 20449 genes
[0:38] Initialising library
WARNING: it is strongly recommended to enable MBR when analysing with a large library, if this is a quantitative analysis

[1:01] File #1/1
[1:01] Loading run 2024-09-04-Megan-QC-HeLa-100ng_20240905050956.mzML
[2:31] 5684934 library precursors are potentially detectable
[2:32] Calibrating with mass accuracies 30 (MS1), 20 (MS2)
[2:46] RT window set to 0.955239
[2:46] Peak width: 2.576
[2:46] Scan window radius set to 5
[2:46] Recommended MS1 mass accuracy setting: 2.57927 ppm
[3:20] Optimised mass accuracy: 8.28205 ppm
[4:36] Removing low confidence identifications
[4:37] Removing interfering precursors
[4:49] Training neural networks on 318247 PSMs
[5:00] Number of IDs at 0.01 FDR: 102617
[5:02] Calculating protein q-values
[5:03] Number of genes identified at 1% FDR: 8330 (precursor-level), 7577 (protein-level) (inference performed using proteotypic peptides only)
[5:03] Quantification
[5:04] Quantification information saved to 2024-09-04-Megan-QC-HeLa-100ng_20240905050956.mzML.quant
[5:09] XICs saved to result/report_xic/2024-09-04-Megan-QC-HeLa-100ng_20240905050956.xic.parquet

[5:10] Cross-run analysis
[5:10] Reading quantification information: 1 files
[5:12] Quantifying peptides
[5:13] Assembling protein groups
[5:15] Quantifying proteins
[5:16] Calculating q-values for protein and gene groups
[5:17] Calculating global q-values for protein and gene groups
[5:17] Protein groups with global q-value <= 0.01: 7705
[5:18] Compressed report saved to result/report.parquet. Use R 'arrow' or Python 'PyArrow' package to process
[5:18] Writing report
[5:20] Report saved to result/report.tsv.
[5:20] Saving precursor levels matrix
[5:20] Precursor levels matrix (1% precursor and protein group FDR) saved to result/report.pr_matrix.tsv.
[5:20] Saving protein group levels matrix
[5:20] Protein group levels matrix (1% precursor FDR and protein group FDR) saved to result/report.pg_matrix.tsv.
[5:20] Saving gene group levels matrix
[5:20] Gene groups levels matrix (1% precursor FDR and protein group FDR) saved to result/report.gg_matrix.tsv.
[5:20] Saving unique genes levels matrix
[5:20] Unique genes levels matrix (1% precursor FDR and protein group FDR) saved to result/report.unique_genes_matrix.tsv.
[5:20] Manifest saved to result/report.manifest.txt
[5:20] Stats report saved to result/report.stats.tsv
[5:20] Generating spectral library:
[5:23] 102614 target and 1026 decoy precursors saved
[5:23] Spectral library saved to result/report-lib.parquet

The following warnings or errors (in alphabetic order) were detected at least the indicated number of times:
WARNING: it is strongly recommended to enable MBR when analysing with a large library, if this is a quantitative analysis : 1
Finished

How to cite:
using DIA-NN: Demichev et al, Nature Methods, 2020, https://www.nature.com/articles/s41592-019-0638-x
analysing Scanning SWATH: Messner et al, Nature Biotechnology, 2021, https://www.nature.com/articles/s41587-021-00860-4
analysing PTMs: Steger et al, Nature Communications, 2021, https://www.nature.com/articles/s41467-021-25454-1
analysing dia-PASEF: Demichev et al, Nature Communications, 2022, https://www.nature.com/articles/s41467-022-31492-0
analysing Slice-PASEF: Szyrwiel et al, biorxiv, 2022, https://doi.org/10.1101/2022.10.31.514544
plexDIA / multiplexed DIA: Derks et al, Nature Biotechnology, 2023, https://www.nature.com/articles/s41587-022-01389-w
CysQuant: Huang et al, Redox Biology, 2023, https://doi.org/10.1016/j.redox.2023.102908
using QuantUMS: Kistner at al, biorxiv, 2023, https://doi.org/10.1101/2023.06.20.545604
[5:23] Log saved to result/report.log.txt

now report_xic folder and xic.parquet were generated inside of result folder. I guess the leading '/' in '/report_xic/' can be an issue if the result is directly saved in the same location where the software is executed. For now, I will make sure to have a separate output folder to avoid this issue.

Thank you!

best, heejong