nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
390 stars 120 forks source link

Trimming CTC data #253

Open mbhall88 opened 2 years ago

mbhall88 commented 2 years ago

Hi,

I am having some issues relating to demultiplexing. I have trained a custom model, it performs extremely well on the DNA of my species of interest, but falls over when it comes to demultiplexing. I am losing more than half of my data to the dreaded "none" bin.

In https://github.com/nanoporetech/bonito/issues/26#issuecomment-613401666 it was suggested that trimming the signal could improve this - and this makes a lot of sense. However, the example there assumes trimming in the process of chunkifying a HDF5 file from taiyaki.

I have the chunk data already (from basecalling with --save-ctc) and would like to trim this to achieve the same result as trimming the signal at the starts and the ends by some offset. (I basically want to get rid of signal that relates to the barcode.)

What I am struggling with is how best to do this as I don't know what each of the chunkify files is.

For example, the reference_lengths.npy file has shape (35691,), references.npy has shape (35691, 482), and chunks.npy has shape (35691, 4000). How do each of these files relate to each other?

Let's say I want to trim 100 signal samples from the start and end of each read, how would I do this? (I am open to suggestions for offset sizes - this was an arbitrary number).

touala commented 10 months ago

Any development on this issue? Thanks in advance.

mbhall88 commented 10 months ago

Sadly, no @touala. I never got any response about this issue. I ended up having to abandon my project because of this issue. I tried many different ways to trim the data but couldn't fix this demultiplexing issue unfortunately.

touala commented 9 months ago

Thanks for the response @mbhall88. I'm currently doing the demultiplexing with ONT model and then using my custom model to redo the basecalling... Not great but it seems ok. I'll revisit soon as I need to update all my workflow. Hopefully this got better since last time I tried.