nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
394 stars 121 forks source link

Bonito Convert #123

Open addyblanch opened 3 years ago

addyblanch commented 3 years ago

I've been through the Taiyaki pipeline to create a hdf5 file which I plan to convert into a Bonito model. I seem to have hit a snag any suggestions on what the issue is?

$ bonito convert --chunks 1000000 HQ_mapped.hdf5 model/
Traceback (most recent call last):
  File "/bonenv/bin/bonito", line 33, in <module>
    sys.exit(load_entry_point('ont-bonito==0.3.5', 'console_scripts', 'bonito')())
  File "/bonenv/lib/python3.6/site-packages/bonito/__init__.py", line 39, in main
    args.func(args)
  File "/bonenv/lib/python3.6/site-packages/bonito/cli/convert.py", line 109, in main
    reads = h5py.File(args.chunkify_file, 'r')['Reads']
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/bonenv/lib/python3.6/site-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'Reads' doesn't exist)"
iiSeymour commented 3 years ago

Hey @addyblanch

The converter expects a top level key Reads can your file in Python with -

>>> import h5py
>>> f = h5py.File('HQ_mapped.hdf5', 'r')                                       
>>> list(f.keys())
['Reads']
addyblanch commented 3 years ago

Hi @iiSeymour, did you mean print output from?

This is what I get:

import h5py f = h5py.File('HQ_mapped.hdf5', 'r') list(f.keys()) ['Batches', 'read_ids']

iiSeymour commented 3 years ago

Was HQ_mapped.hdf5 output by prepare_mapped_reads.py because you should have ended up with hdf5 file like this?

addyblanch commented 3 years ago

Yes it was, but I did the alignment step minimap2 rather than guppy_aligner. Would that cause this issue?

iiSeymour commented 3 years ago

Which version of Taiyaki are you using @addyblanch?

addyblanch commented 3 years ago

Based on the changeling in the directory v5.3.0?

jackwadden commented 3 years ago

I'm also having this issue with v5.3.0 and have the same top level keys: ['Batches', 'read_ids']

addyblanch commented 3 years ago

I'm also having this issue with v5.3.0 and have the same top level keys: ['Batches', 'read_ids']

Hopefully they are working on a solution.

jackwadden commented 3 years ago

@addyblanch I was able to downgrade to Taiyaki 5.0.0 and it worked. The issue seems to stem from this Taiyaki change in 5.2 linked to by @iiSeymour

The batched variant of the HDF5 mapped signal format was introduced in version 5.2. This variant replaces the Reads group with a Batches group. Each group within the Batches group contain the same set of attributes and datasets listed in the table above, but these values for a set of reads are concatenated together into one dataset per batch.

I might take a stab at trying to fix it this weekend and will send a fork along if I manage to get it working before @iiSeymour

addyblanch commented 3 years ago

Thanks @jackwadden for the heads up, that would be great work around short term.

jackwadden commented 3 years ago

This turned out to be a small bug in Taiyaki, and was an easy fix. I've submitted a pull request with the fix here.

addyblanch commented 3 years ago

Thats amazing thanks @jackwadden! I've made the edit on my end and set it to rerun. Fingers crossed.

jackwadden commented 3 years ago

I'm having issues with bonito now that were resolved by downgrading back to Taiyaki 5.0.0. The specific error was thrown by parasail. Let me know if you get a similar error. I'm back in the territory where it's most likely a problem with my code, but would be nice to know if you run into something similar.

addyblanch commented 3 years ago

Hi @jackwadden unfortunately no dice. Same error as before minus the last line

KeyError: "Unable to open object (object 'Reads' doesn't exist)"

Is there a fix in the work @iiSeymour if not I'll downgrade Taiyaki and ty again soon.

jackwadden commented 3 years ago

@addyblanch the fact that the 'Reads' directory doesn't exist means that Taiyaki (probably) still isn't emitting the non-batched version. Are you seeing the same output from list(f.keys())? You might have to re-install Taiyaki? Maybe pop a print("changed") in main() to see if your changes are actually being adopted.

Another option might be to just use bonito end-to-end. I don't know what your use-case is, but you might be able to use this method to prepare reads and train a model. Just omit the --pretrained <model> option when you train.

Good luck.

addyblanch commented 3 years ago

Hi @jackwadden, yes same output from list(f.keys()) will have a go with version 5.0.0 in the coming weeks.

I have tried the end-to-end bonito model training but it didn't solve our issues (made the assemblies slightly worse), so I was following this (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03856-0) as they seemed to have some success. I work on streptococcus and any genome we try and sequence seems to end up inflated in size and includes an awful number of pseudogenes (we suspect due to errors causing erroneous start and stop codons).