Issue with extracting out the sequence with genome_resquiggle with direct RNA sequencing

awjga commented 6 years ago

Hi,

I would like to highlight an issue about the extracting of the sequences from the basecalled fast5 to be used in mapping for genome_resquiggle.

From my understanding of your code, you are extracting the character middle of "model_state", and the value in "move" must be greater than 0. The extracted character would be combined and will be used for mapping.

The extraction of the sequences is based on the assumption the most of the signal would only move by 1 nucleotide? This is where the issue is. As I am using direct RNA sequencing, ~16% of my signal in 1 read have the move value = 2, and it occurs throughout the read. Thus, the extracted sequence from nanoraw is quite different from the basecalled sequences. And the sequence in nanoraw and from albacore will be flipped and there is U instead of T in the albacore sequences, as it is direct RNA sequencing.
There is 878 1 and 201 2 in my read and the sequenced length is 1284, and the extracted sequence length from nanoraw is 1079.

I have tested this with the control RNA, and it is about 1.2kb long.

marcus1487 commented 6 years ago

Hi,

A large number of skip sequences will be an issue for nanoraw, but only as mapping is concerned. Once a read is mapped, the genomic sequence is the only sequence used in the nanoraw framework. Thus a large number of skip states from RNA would cause an issue with fewer reads likely mapping, but should not cause major issues if the nanoraw extracted read maps to the genome.

As for the U to T conversion, the Event tables still contain T's (while the Fastq slot contains U's), and since nanoraw extracts sequence for the Events table the U's in the Fastq are not an issue for nanoraw.

On the other hand the reversed sequence is an issue for nanoraw. Currently, the recommended hack is to reverse the genome/transcriptome to which you are mapping. Obviously this would mean that any coordinates would be flipped as well, but for visualization purposed etc the whole nanoraw framework should be work (these issues not withstanding). Not sure there are great tools for this, but I hacked something together with Biopython to accomplish this task.

I hope this helps and good luck!

awjga commented 6 years ago

Hi,

Thanks for the calcification! So it means that the skipping would only affects the number of reads that can be analyzed, but it should not have any major issues to assign the signal to the base?

marcus1487 / nanoraw

Issue with extracting out the sequence with genome_resquiggle with direct RNA sequencing #47