nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
231 stars 54 forks source link

Output of #66

Closed Somebodyatthdoor closed 6 years ago

Somebodyatthdoor commented 6 years ago

Hi,

Sorry if this is a rather simple question, I am quite new to bioinformatics. I have been running tombo on some RNA sequences and have been trying to perform de novo base detection. After running

tombo test_significance --fast5-basedirs fast5location --statistics-file-basename filename

tombo plot_most_significant --fast5-basedirs fast5location --statistics-filename statslocation --plot-standard-model

The output pdf file looks a little strange, the raw signal does not seem to match the overlay. I am presuming I have done something wrong and would be very grateful for some advice. tombo_results.significant_difference.pdf

marcus1487 commented 6 years ago

Hi,

These certainly are quite strange signal assignments. My best guess is that these are likely assignments of signal to regions of sequence that do not actually exist in your sample. For example if you are mapping spliced mRNA to a genomic reference, you could get these types of signal assignment through introns. This can also happen around real genomic/transriptomic deletions or insertions if this was mapped to a transcriptome reference. There are some other tips on direct RNA processing here in the documentation that might be helpful.

These reads should have better signal assignment through neighboring regions, or else they would have been filtered out based on poor expected to observed signal matching. Hopefully this help!

mw55309 commented 6 years ago

Hi Marcus

These are alignments direct to transcripts and these are definitely transcripts expressed in the sample. They are genes we know are turned on in these samples and they have many 1000s of reads...

Cheers

Mick

mw55309 commented 6 years ago

When will the recommended RNA processing pipeline be posted?

marcus1487 commented 6 years ago

These are certainly very bad alignments in this case. It could be that this is an issue with the RNA model included with tombo, which is currently generated from 200mV data. I will have a new 180mV model included in the next release. There are also some other signal normalization and processing updates that should improve regions like this in the upcoming release.

If you are mapping to a transcriptome reference then the recommended rna processing pipeline will not be of help here. That processing pipeline is intended for users starting from a genomic reference and annotated genes. I will update the docs to reflect this point.

mw55309 commented 6 years ago

Thanks Marcus :)

Is there any way we can view / export, for each position in the transcript, the values/distribution of signal at that position; and related, how do we see the expected distribution from the model? Cheers Mick

marcus1487 commented 6 years ago

Hi Mick, For the signal distribution part, there is an ‘—overplot-type’ option to the plotting commands. Here are some examples of (these plots)[https://nanoporetech.github.io/tombo/plotting.html#over-plotting]. The density plot type would probably be best here.

For genome browser text output, the write_wiggles command has a mean signal output option (docs here)[https://nanoporetech.github.io/tombo/text_output.html#write-wiggles]. I don’t have an option to output the expected model levels, but could look into adding that option later. This output might be useful since these poor fitting regions will likely have pretty flat average signal. It might give you an idea of how pervasive is this poor signal assignment.

I will warn that these 2 commands are currently kind of slow since all plotted signal levels must be extracted from the fast5 files.

mw55309 commented 6 years ago

Sorry Marcus, I'm not sure I understand.

What I would love is:

For each position in the transcript, for each read that aligns to that position - the signal for that read at that position. I don't want the mean.

Basically, because you don't have good models, we are going to have to compare two samples. So I can get a feel for the data, what I will want to do is compare the distribution of signal at each position in one sample with the distribution of signal at the same position in a different sample. For this, I guess I will need the aforementioned data.

What is the best way to get that? (I don't want to plot it, I want the values)

marcus1487 commented 6 years ago

Hi Mick, this is certainly possible. I tried to get more of a cleanup done on the Tombo API with this release, but more pressing issues prevailed. There are 2 different methods to access to the data that would likely be helpful here.

The first approach would be extracting the Tombo-normalized signal levels for each read at a given genomic location, as you requested. These signal levels are close to median normalized values but with a genome sequence-dependent correction to the shift and scale parameters in this recent release (docs here). Fair warning that I am likely to change this interface slightly in a patch version at some point, but that will hopefully come with an associated API examples section in the docs, so I will try to remember to link that here. Here is a bit of code to extract these signal levels from the python Tombo API in the current version (1.3):

from tombo import tombo_helper as th, plot_commands as tpc
chrm, strand, int_start, int_end = 'chr20', '+', 10000, 10100

fast5_dirs = ['test_data/native_reads',]
fast5s_data = th.parse_fast5s(fast5_dirs, 'RawGenomeCorrected_000', ['BaseCalled_template',])
cs_reads = [r_data for r_data in fast5s_data[(chrm, strand)] if not r_data.filtered]
reg_base_levels = tpc.get_reg_events(cs_reads, int_start, int_end, strand, read_rows=False, num_reads=None)

Here the reg_base_levels will be a numpy array where each row is a read and each column is a genomic position (with nan values where reads did not map to the whole region). You can flip the row/column or downsample reads with the last 2 options. Also note that the int_start and int_end are 0-based closed-open coordinates and the numpy array is 0-based indexing for genomic position starting from the int_start position. Hopefully that mades sense.

The other part of the Tombo API that might be useful is the PerReadStats interface if you have per-read testing values that you would like to process. I have discussed this issue previously (and this interface is less likely to change soon). Here is a comment with a minimal code example to access this interface (and the thread has some useful information as well).

Hopefully this helps, and I will try to get the API cleanup and API examples section together soon since these requests are becoming more common, and github issue pages are likely not the best place to store this info.

marcus1487 commented 6 years ago

Hi @Somebodyatthdoor and @mw55309 , the current release (1.4) now makes this task much more concise with a complete python API re-write. You can see that this task is essentially the first example in the API docs here.

Do note here if you run into trouble with the new API. Thanks!

marcus1487 commented 6 years ago

Also note that the get_ref_from_seq function has been added to the API in order extract the expected signal levels from a bit of genomic sequence (as this was requested above as well).

I'll also note here that the RNA re-squiggle procedure and models are drastically improved with this release, so you may be okay with the standard Tombo model testing with this version.