vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
284 stars 53 forks source link

Malformed Library tsv files when Gene ID at the end of fasta description #1110

Open SeereouslyDrewNichols opened 4 months ago

SeereouslyDrewNichols commented 4 months ago

There is an issue w/ DIANN when using the --gen-spec-lib flag to generate a library when the gene ID is at the end of a fasta description. Concretely, if the fasta description for an entry in the fasta file looks like this, where GN=XXXX is the last entry, DIANN picks up the linebreak as well as the GN=XXXX and includes that in the generated library. That results in a malformed tsv file where there is a linebreak in the Genes column. (See attached picture). Would it be possible to patch v1.8.1 and above w/ this fix?

sp|Q29536-2|KPYR_CANLF Isoform L-type of Pyruvate kinase PKLR OS=Canis lupus familiaris OX=9615 GN=PKLR

image

vdemichev commented 4 months ago

Thanks for the feedback. I cannot fix 1.8.1, but this should not be a problem in 1.9.1, since it writes libraries in .parquet format. I will also check if there's still an issue with trailing line break.

Best, Vadim

SeereouslyDrewNichols commented 4 months ago

Any chance of releasing the 1.8.1 code base? I can patch it myself.

vdemichev commented 4 months ago

No, it's intended to be closed source