How to interpret failed reads during modkit extract

nanoporetech / modkit

A bioinformatics tool for working with modified bases

https://nanoporetech.com/

Other

130 stars 7 forks source link

How to interpret failed reads during modkit extract #220

Open lkwhite opened 2 months ago

lkwhite commented 2 months ago

I have two datasets:

SUP-rebasecalled direct RNA with pseudouridine calling, aligned to a reference
The same data as above, but now put through reference-anchored inference with remora for pseU

When I run modkit extract on these, I get two very different results:

processed 151548 reads, 7390161 rows, skipped ~36 reads, failed ~0 reads
processed 3859 reads, 110317 rows, skipped ~0 reads, failed ~147524 reads

Is this an expected behavior? How should I interpret this?

ArtRand commented 2 months ago

Hello @lkwhite,

Could you run modkit extract with --log-filepath $log_file and attach it? It should tell you why reads are failing.

lkwhite commented 2 months ago

Looks like we have a whole bunch of record XXX has improper data, MN tag length X and seq length Y don't match.

These are tRNA sequencing data so those lengths are 85-135 nt and we use BWA MEM instead of mm2 for alignment.

lkwhite commented 2 months ago

The log was a little too big to attach so I've split it in two parts. mkextract_postremora_part1.txt mkextract_postremora_part2.txt

ArtRand commented 2 months ago

Hello @lkwhite,

It is possible that remora reference anchoring doesn't emit the correct MN tag or doesn't update it when the sequence length changes. Could you try removing the MN tags and seeing if modkit extract will parse the base modifications? As a reminder you can remove the tags with samtools:

samtools view -bhx MN ${bam} | modkit extract - extract.tsv --log-filepath test_tags.log

If the base modification tags are actually incorrect, you'll get different errors. If this works and you want the --read-calls output you'll have to write the output of samtools view to a file.

Let me know,

lkwhite commented 2 months ago

That reduces the % of reads failing, and the ones that fail now say record has improper data, malformed MM delta list.

I couldn't find MN in the sam spec, how is Remora using this tag?

test_tags.log

ArtRand commented 2 months ago

Hello @lkwhite,

The MN tag isn't in the spec yet, and remora doesn't use it. But dorado does, so we need to update the recommendation to remove the tags when using remora infer.

Looks like quite a few reads are failing with the "improper data" error. I've extracted the read ids and attached them to this thread:

grep -Ei 'record [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} has improper data' ${fp} | grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}' > malformed_read_ids.txt

Could you send me a few of these BAM records? Preferably both before and after reference-anchored base modification inference. If the files are too large (or you don't want them on github) you can email me at art.rand[at]nanoporetech.com and we can work out a way to share them.

Thanks.

malformed_read_ids.txt

ArtRand commented 2 months ago

Just in case anyone else encounters an issue with large number of skipped or error reads following "reference-anchored" remora base modification calling.

If you have previously used base modification calling with dorado, it seems there is a bug in remora where the output will have multiple MM tags (the original and the reference-anchored one). The parser in modkit will throw an error on these reads since the MN tag will not match the SEQ length - which is correct. If you remove the MN tag, you will get around this error, but now the basecall-anchroed base modification call will be used incorrectly or the read will fail completely. A modkit command to fix these tags and remora fix are in progress.

The correct work-around is to either not use base modification calling in the original dorado command (but include --emit-moves) or make sure to completely remove the MM/ML/MN tags prior to remora reference-anchored inference.