vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
237 stars 51 forks source link

FASTA file with windows line-ending get read wrongly in the linux version #688

Open ghuls opened 1 year ago

ghuls commented 1 year ago

FASTA file with windows line-ending get read wrongly in the linux version.

Reading the same FASTA file in the Windows version and Linux version resulted in less proteins.

Windows:

[0:00] Loading FASTA F:\uniprot-download_true_format_fasta_query__28_2A_29_20AND_20_28model_-2023.04.29-13.24.09.48.fasta
[0:06] Processing FASTA
[0:21] Assembling elution groups
[0:31] 3925057 precursors generated
[0:31] Gene names missing for some isoforms
[0:31] Library contains 20404 proteins, and 20186 genes

Linux:

[0:00] Loading FASTA /proteomics/uniprot-download_true_format_fasta_query__28_2A_29_20AND_20_28model_-2023.04.29-13.24.09.48.fasta
[0:03] Processing FASTA
[0:10] Assembling elution groups
[0:15] 2822903 precursors generated
[0:15] Gene names missing for some isoforms
[0:15] Library contains 20381 proteins, and 20169 genes

After removing \r (line feed) characters) from FASTA, it works fine in Linux:

cat uniprot-download_true_format_fasta_query__28_2A_29_20AND_20_28model_-2023.04.29-13.24.09.48.fasta | tr -d '\r' > uniprot-download_true_format_fasta_query__28_2A_29_20AND_20_28model_-2023.04.29-13.24.09.48.fixed.fasta

Linux:

[0:00] Loading FASTA /proteomics/uniprot-download_true_format_fasta_query__28_2A_29_20AND_20_28model_-2023.04.29-13.24.09.48.fixed.fasta
[0:04] Processing FASTA
[0:13] Assembling elution groups
[0:19] 3925057 precursors generated
[0:19] Gene names missing for some isoforms
[0:19] Library contains 20404 proteins, and 20186 genes

DIA-NN should ignore thos \r characters when reading the file, or at least complain about it, if found.

vdemichev commented 1 year ago

Thank you for spotting this. DIA-NN uses standard C++ functions to process text files, apparently these don't work well cross-platform...