Closed haotianteng closed 6 years ago
For the move state > 1 issue, this is not a bug, but a decision based on how to assign signal to genomic bases. I have discussed the details of this in more detail in an older issue #50.
For the RNA data, I will note here that RNA samples should be mapped to the reversed genome (not reverse complement) due to the 3'->5' read direction for RNA data as they will not map otherwise. Also RNA data should be mapped to a transcriptome and not the genome as nanoraw cannot handle spliced reads appropriately at this time.
I will also point you to a derivative work I am currently supporting as part of Oxford Nanopore Technologies which was released just last week: https://github.com/nanoporetech/tombo This software adds many additional capabilities to the nanoraw framework and should likely be used instead of nanoraw moving forward.
Thanks a lot!
Try to use nanoraw to re-squiggle some DNA dataset, work fine with E.coli dataset, however, when use nanoraw to re-squiggle an RNA dataset and Human dataset, few reads are able to align back to the reference. And I found the extracted sequence by Nanoraw is not consistent with the given basecalled sequence in the fast5 file.
Looking into the code and find a potential bug in fix_stay_states function in resquiggle.py The sequence is not correctly extracted when the move of the event is 2, because it only include the third base of the kmer for every event whose move > 0 https://github.com/marcus1487/nanoraw/blob/master/nanoraw/resquiggle.py#L770 https://github.com/marcus1487/nanoraw/blob/master/nanoraw/resquiggle.py#L803
So this causes the extracted sequence by Nanoraw very different when there are many moves with a stepsize 2. thus lead to a failed alignment either using graphmap or BWA-mem.
An example fast5 file, sequence generated by Nanoraw(ch138_strand_Nanoraw.fasta), sequence extracted from fast5 and transfer to dna(ch138_strand_cdna.fastq) and the reference is attached. test.zip