Closed menickname closed 4 years ago
Hey @menickname
That sounds like a great idea and I'm happy to help.
Did you see the notebook directory which you can run in Google Colab?
Bonito uses the same HDF5 chunkify format that Taiyaki uses so you'll want to start with that, there is good documentation in Taiyaki repo. The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome. Once you have created a chunkify file using taiyaki/prepare_mapped_reads.py
you can then use bonito/convert-data
which will produce 4 .npy
files ready for training -
>>> # references.npy
>>> # Integer encoded target sequence {'A': 1, 'C': 2, 'G': 3, 'T': 4}
>>> # Variable length and zero padded (default range between 128 and 256).
>>> np.load('references.npy').shape
(1000000, 256)
>>> np.load('references.npy').dtype
dtype('uint8')
>>>
>>> # reference_lengths.npy
>>> # Lengths of target sequences in references.npy
>>> np.load('reference_lengths.npy').shape
(1000000,)
>>> np.load('reference_lengths.npy').dtype
dtype('uint8')
>>>
>>> # chunks.npy
>>> # Sections of squiggle that correspond with the target reference sequence
>>> # Variable length and zero padded (upto 4096 samples).
>>> np.load('chunks.npy').shape
(1000000, 4096)
>>> np.load('chunks.npy').dtype
dtype('float32')
>>>
>>> # chunk_lengths.npy
>>> # Lengths of squiggle sections in chunks.npy
>>> np.load('chunk_lengths.npy').shape
(1000000,)
>>> np.load('chunk_lengths.npy').dtype
dtype('uint16')
Bonito uses CTC to train, this a good article explaining how it works https://distill.pub/2017/ctc/. For training, you just need to give a directory to store the model output, the model config and use --directory
to select your custom .npy
training files.
$ bonito train /data/custom-model-dir config/quartznet5x5.toml --directory /data/custom-chunk-dir
Give it shot and let me know if anything isn't clear or if you run into any problems.
Chris.
Dear @iiSeymour
Thanks a lot for your prompt response. I will have a look into that. I think I needed this short summary to get from step A to B and so fort to step bonito train. I ended up reading in that notebook directory indeed, however I got confused reading issues on Github and the notebook. I will try your proposed flow and see how it goes from there. If I encounter any further problems, I will contact you.
One small question for now:
"The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome."
This means I should also stick to this 66,000 reads (I don't think the exact size matters here?). I would like to train the model for 1 specific species as I think the basecalling performance is due to the high AT content and numerous repeat stretches of the bacteria we are currently working on. However, I have 10 independent read files of one same isolate, which I might split to get an equal sample read input to train Bonito? If so, I might start training first of all with 1 single dataset and then retrain Bonito with a mixture of all 10 independent runs?
Thank you in advance. Best regards, Nick
Dear @iiSeymour
Ones again, thank you for your answer earlier today. I have been "playing" around with the taiyaki today and succeeded in getting to the final step - prepare_mapped_reads.py. For this step we have to supply a model file. Currently, I have downloaded and used the r941_dna_minion.checkpoint file available in the taiyaki_walkthrough available for guppy. I was still wondering the following:
Thank you in advance. Kind regards, Nick Vereecke
Right, the exact size (66,000) isn't important it was more just to let you know that don't need millions of reads for model training. Stick with Guppy for both questions (1, 2), you take the reference sequence from mapping anyway so the caller isn't important.
HTH,
Chris.
Dear @iiSeymour
I have successfully generated the chunkify file, however when running the Bonito convert-data command I get the following message:
_$ ./convert-data --chunks 1000000 ../../Bonito_Train/Bonito_Nick.hdf5 ../../Bonito_Train/
[ 0.07494327 0. -0.04496622 ... 0.3897055 0.20984128
0.52460355]
Traceback (most recent call last):
File "./convert-data", line 226, in
I went through all the previous steps again, however could not figure out where something went wrong. Thank you in advance.
Sorry @menickname, I had left a bad debug print in the conversation script, can you do a git pull
and try again.
@menickname , indeed! Performs perfectly now. The 4 .npy files have been generated and I will try to start up the training now on HPC GPU resources.
Thank you for your support.
Dear @iiSeymour
The first Bonito train is currently running on our HPC-GPU cluster. However, with the current release we are not able to run on multi-GPU. In the https://github.com/nanoporetech/bonito/issues/13#issuecomment-597711735 issue I saw you suggested to add some patches to the train.py file. Is there a specific reason this has not been implemented yet in the code itself?
Thank you in advance.
On the multi-gpu support:
I found DataParallel could hang multi-gpu systems without NVLink/NVSwitch so I haven't merged it yet, guarding the import and uses behind --multi-gpu is probably safe so I'll look to get this in master.
I will look to get this merged in today.
@iiSeymour Do you have more information on that hang issue?
How does it manifest, etc., so we can keep an eye on it?
There is a PyTorch thread on the issue here.
The hanging happens on the import of DataParallel
so you never see the first epoch progressing. See this comment, try running p2pBandwithLatencyTest
which can be found in the samples directory of your CUDA install to check your system's device to device latency.
This is all single node multi-gpu b.t.w and I can confirm Nvidia DGX systems are fine as they use NVLink/NVSwitch. Is multi-node something you are wanting to look at?
p2pBandwidthLatencyTest
doesn't seem to be there anymore in CUDA 10.1.243...
We have 4 NVIDIA Tesla V100 GPUs in each box, so we're only looking at single node multi-GPU for now, which would be a nice step up over single GPU. A next step could be multi-node, since our GPU cluster has Infiniband (double EDR), but focus is on single-node multi-GPU for now...
Seems like we do have NVLink:
$ nvidia-smi nvlink -s
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-7de090ff-01c3-5997-3e7a-af13b14a1b86)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
Link 4: 25.781 GB/s
Link 5: 25.781 GB/s
# similar output for other 3 GPUs
Yes, it looks like you are good!
The samples are optional; they are here if you are interested - https://github.com/NVIDIA/cuda-samples
Dear @iiSeymour
Bonito train is running perfectly on multi-gpu after installing the patched Bonito version. Please find here the functional version https://gist.github.com/boegel/e0c303497ba6275423b39ea1c10c7d73 from @boegel. This one works on our HPC-GPU resource and fixes 2 of the afore-mentioned issues.
Still 2 questions:
Thank you in advance.
For distributed training, I think @iiSeymour is planning to look into adding support for leveraging RaySGD from Bonito. I wish I had time to help out with that, but I have way too much on my plate already... If the necessary changes have been made to Bonito, I'm certainly willing to help test it though!
Version 0.1.2 is on PyPI now and includes the mutli-gpu patch along with the reworked solution for including the old scripts directory.
What was scripts/convert-data
is now available as bonito convert
and bonito download
has been added for handling pretrained models and training data, full details here.
Hi, thanks for your tool :-) I try to train a RNA model... but Bonito has only DNA model so I have used tayaki as start point with prepare_mapped_reads.py ../fast5/ params_per_read.tsv outputTayaki.fast5 ~/softs/taiyaki/models/mGru_flipflop.py refFromSam.fasta As @menickname, I have successfully generated the chunkify file. I think :
import h5py f = h5py.File('outputTayaki.fast5', 'r') list(f.keys()) ['Reads']
With tayaki v5.0.0 only because with the last version (with mLstm_flipflop.py advised for guppy compatibility), I had : list(f.keys()) ['Batches', 'read_ids']
So, I'm happy :-) and I hope to be on the right track.
However when running the Bonito convert command I get the following message:
bonito -v bonito 0.6.1
bonito convert --chunksize 1000000 outputTayaki.fast5 bonito_convert /home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py:76: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. reads = np.random.permutation(sorted(reads.items()))
preparing training chunks 0it [00:00, ?it/s] Traceback (most recent call last): File "/home/xxx/miniconda3/bin/bonito", line 8, in
sys.exit(main()) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 113, in main training_chunks = chunk_dataset(training, args.chunksize) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 70, in chunk_dataset chunks, targets = zip(*tqdm(take(all_chunks, num_chunks), total=num_chunks)) ValueError: not enough values to unpack (expected 2, got 0)
I have tried with --chunksize 100000 but same error. Can you help me please ? Thank you in advance :-)
Without argument chunksize (by default is 3600...) it works ! But I have some doubt. Which impact of this parameter chunksize ? Thanks for your answer
Dear @iiSeymour
I am currently experimenting with the new Bonito basecaller (in comparison with the Guppy basecaller).
Recently, we have generated a big dataset of bacterial ONT sequences of a highly AT-rich organism. In a de novo assembly, we have noticed we are not able to reach good final assembly Q-scores in comparison to generated Illumina data when running Guppy basecaller. As such, we started investing some time in running Bonito basecaller, followed by de novo assembly. However, no improvement was observed here. We are aware the current Bonito basecaller has not been trained with this type of data, thus we checked first if our Bonito basecaller was performing well on an in-house generated E. coli species dataset. Which rendered highly improved final genome assembly Q-scores (in line with Illumina Q-scores) when using Bonito in comparison with Guppy.
Now we would like to try to train the Bonito basecaller with our own organism dataset(s) to reach similar improved genome assembly Q-scores as with the E. coli dataset. I have been reading through multiple deposits on Github on Bonito, however could not figure out a clear workflow on "how to train the Bonito basecaller". Is it possible to get a (little) walk-through on how to train the bonito model with own input data?
Thank you in advance. Best regards, Nick Vereecke