Open lkwhite opened 2 months ago
Hello @lkwhite,
Could you run modkit extract
with --log-filepath $log_file
and attach it? It should tell you why reads are failing.
Looks like we have a whole bunch of record XXX has improper data, MN tag length X and seq length Y don't match
.
These are tRNA sequencing data so those lengths are 85-135 nt and we use BWA MEM instead of mm2 for alignment.
The log was a little too big to attach so I've split it in two parts. mkextract_postremora_part1.txt mkextract_postremora_part2.txt
Hello @lkwhite,
It is possible that remora reference anchoring doesn't emit the correct MN tag or doesn't update it when the sequence length changes. Could you try removing the MN tags and seeing if modkit extract
will parse the base modifications? As a reminder you can remove the tags with samtools
:
samtools view -bhx MN ${bam} | modkit extract - extract.tsv --log-filepath test_tags.log
If the base modification tags are actually incorrect, you'll get different errors. If this works and you want the --read-calls
output you'll have to write the output of samtools view
to a file.
Let me know,
A
That reduces the % of reads failing, and the ones that fail now say record has improper data, malformed MM delta list
.
I couldn't find MN in the sam spec, how is Remora using this tag?
Hello @lkwhite,
The MN tag isn't in the spec yet, and remora doesn't use it. But dorado does, so we need to update the recommendation to remove the tags when using remora infer
.
Looks like quite a few reads are failing with the "improper data" error. I've extracted the read ids and attached them to this thread:
grep -Ei 'record [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} has improper data' ${fp} | grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' > malformed_read_ids.txt
Could you send me a few of these BAM records? Preferably both before and after reference-anchored base modification inference. If the files are too large (or you don't want them on github) you can email me at art.rand[at]nanoporetech.com and we can work out a way to share them.
Thanks.
Just in case anyone else encounters an issue with large number of skipped or error reads following "reference-anchored" remora base modification calling.
If you have previously used base modification calling with dorado
, it seems there is a bug in remora where the output will have multiple MM
tags (the original and the reference-anchored one). The parser in modkit
will throw an error on these reads since the MN tag will not match the SEQ length - which is correct. If you remove the MN
tag, you will get around this error, but now the basecall-anchroed base modification call will be used incorrectly or the read will fail completely. A modkit command to fix these tags and remora fix are in progress.
The correct work-around is to either not use base modification calling in the original dorado
command (but include --emit-moves
) or make sure to completely remove the MM/ML/MN tags prior to remora reference-anchored inference.
I have two datasets:
When I run
modkit extract
on these, I get two very different results:Is this an expected behavior? How should I interpret this?