Closed hoelzer closed 3 years ago
I updated minimap2 and samtools to newer versions, lets see if this helps... pushed the version updates to the open PR as well #34
I think I stumbled across this some time ago.
The thing is, that the length of read names in the SAM format specification is limited to 254.
So samtools
does basically the right thing, cause the format is incorrect.
If we want to avoid this we'd have to check every input read... Not sure if it's possible to cut the read names only if this error occures.
Do you actually remember why we do this renaming in the first place? :)
We insert DECONTAMINATE
strings into the read ID to replace whitspaces I think. Likely, to restore the full read ID after mapping/read extraction?
see:
@11dbdf9c-e46f-4afd-b1b1-b7c3cdd2e938DECONTAMINATErunid=8842aa4e3c221ff265498dfdfdfe183253ab4499DECONTAMINATEread=9DECONTAMINATEch=247DECONTAMINATEstart_time=2021-04-28T12:20:06ZDECONTAMINATEflow_cell_id=FAP92766DECONTAMINATEprotocol_group_id=210428_GI5_Run21-086DECONTAMINATEsample_id=21-03679
Maybe, during the step where we anyway rename reads, we could check for the length and cut the string of at 254 characters?
Or we should really think if this renaming is necessary at all... I think the idea was to have the original (and complete) read IDs in the decontaminated FASTQ files.
Yeah, I also think it was because of the whitespaces and the mapper messing with the readnames!
Maybe, during the step where we anyway rename reads, we could check for the length and cut the string of at 254 characters?
and we could use something shorter than DECONTAMINATE
^^
and we could use something shorter than
DECONTAMINATE
^^
haha, true ^^ we just need to be sure that it can be safely replaced... what we could also do: rename to something like
read1 read2 ... readXYZ
and save a map tsv between original name and re-name and finally use that to restore the original names.
But we should also not blow up this task...
here is a simple rename/restor py script for FASTAs:
https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/bin/rename_fasta.py
but maybe that's also easily doable w/ seqkit
or so... https://bioinf.shenwei.me/seqkit/
So Basti says minimap
does the right thing, because only the first part (til the first whitespace) is the actual read name, the rest is just additional information.
From this point of view, we could remove all the renaming :sweat_smile:
yeah... basically true. I think I wanted to keep all information bc/ e.g. I am not sure what happens if someone uses then the cleaned FASTQ w/ the missing description in e.g. a QC tool like NanoPlot etc...
But basically right... we can also skip the renaming and see if this causes trouble... this will also save time and disk usage. I'm fine w/ that
command:
error:
I think the problem is that the read names are too long for samtools? But if so, why did we not discover that before?