nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
395 stars 121 forks source link

Workflow Training new Bonito Model #22

Closed menickname closed 4 years ago

menickname commented 4 years ago

Dear @iiSeymour

I am currently experimenting with the new Bonito basecaller (in comparison with the Guppy basecaller).

Recently, we have generated a big dataset of bacterial ONT sequences of a highly AT-rich organism. In a de novo assembly, we have noticed we are not able to reach good final assembly Q-scores in comparison to generated Illumina data when running Guppy basecaller. As such, we started investing some time in running Bonito basecaller, followed by de novo assembly. However, no improvement was observed here. We are aware the current Bonito basecaller has not been trained with this type of data, thus we checked first if our Bonito basecaller was performing well on an in-house generated E. coli species dataset. Which rendered highly improved final genome assembly Q-scores (in line with Illumina Q-scores) when using Bonito in comparison with Guppy.

Now we would like to try to train the Bonito basecaller with our own organism dataset(s) to reach similar improved genome assembly Q-scores as with the E. coli dataset. I have been reading through multiple deposits on Github on Bonito, however could not figure out a clear workflow on "how to train the Bonito basecaller". Is it possible to get a (little) walk-through on how to train the bonito model with own input data?

Thank you in advance. Best regards, Nick Vereecke

iiSeymour commented 4 years ago

Hey @menickname

That sounds like a great idea and I'm happy to help.

Did you see the notebook directory which you can run in Google Colab?

Bonito uses the same HDF5 chunkify format that Taiyaki uses so you'll want to start with that, there is good documentation in Taiyaki repo. The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome. Once you have created a chunkify file using taiyaki/prepare_mapped_reads.py you can then use bonito/convert-data which will produce 4 .npy files ready for training -

>>> # references.npy
>>> # Integer encoded target sequence {'A': 1, 'C': 2, 'G': 3, 'T': 4}
>>> # Variable length and zero padded (default range between 128 and 256).
>>> np.load('references.npy').shape
(1000000, 256)
>>> np.load('references.npy').dtype
dtype('uint8')
>>>
>>> # reference_lengths.npy
>>> # Lengths of target sequences in references.npy
>>> np.load('reference_lengths.npy').shape
(1000000,)
>>> np.load('reference_lengths.npy').dtype
dtype('uint8')
>>>
>>> # chunks.npy
>>> # Sections of squiggle that correspond with the target reference sequence
>>> # Variable length and zero padded (upto 4096 samples).
>>> np.load('chunks.npy').shape
(1000000, 4096)
>>> np.load('chunks.npy').dtype
dtype('float32')
>>>
>>> # chunk_lengths.npy
>>> # Lengths of squiggle sections in chunks.npy 
>>> np.load('chunk_lengths.npy').shape
(1000000,)
>>> np.load('chunk_lengths.npy').dtype
dtype('uint16')

Bonito uses CTC to train, this a good article explaining how it works https://distill.pub/2017/ctc/. For training, you just need to give a directory to store the model output, the model config and use --directory to select your custom .npy training files.

$ bonito train /data/custom-model-dir config/quartznet5x5.toml --directory /data/custom-chunk-dir

Give it shot and let me know if anything isn't clear or if you run into any problems.

Chris.

menickname commented 4 years ago

Dear @iiSeymour

Thanks a lot for your prompt response. I will have a look into that. I think I needed this short summary to get from step A to B and so fort to step bonito train. I ended up reading in that notebook directory indeed, however I got confused reading issues on Github and the notebook. I will try your proposed flow and see how it goes from there. If I encounter any further problems, I will contact you.

One small question for now:

"The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome."

This means I should also stick to this 66,000 reads (I don't think the exact size matters here?). I would like to train the model for 1 specific species as I think the basecalling performance is due to the high AT content and numerous repeat stretches of the bacteria we are currently working on. However, I have 10 independent read files of one same isolate, which I might split to get an equal sample read input to train Bonito? If so, I might start training first of all with 1 single dataset and then retrain Bonito with a mixture of all 10 independent runs?

Thank you in advance. Best regards, Nick

menickname commented 4 years ago

Dear @iiSeymour

Ones again, thank you for your answer earlier today. I have been "playing" around with the taiyaki today and succeeded in getting to the final step - prepare_mapped_reads.py. For this step we have to supply a model file. Currently, I have downloaded and used the r941_dna_minion.checkpoint file available in the taiyaki_walkthrough available for guppy. I was still wondering the following:

  1. Should I use Bonito basecalled data to perform the guppy_aligner step? I don't think so as this does not matter?
  2. Should I use a Bonito specific model file as input for the prepare_mapped_reads.py step? I have been looking through the files available in the Bonito downloads, however could not find the correct model file for this purpose. If so, is there any starting file present available to do this? or should I generate this first myself? The only files I see in Bonito as model files of the dna_r9.4.1 directory are a conf.toml and weights_1.tar file.

Thank you in advance. Kind regards, Nick Vereecke

iiSeymour commented 4 years ago

Right, the exact size (66,000) isn't important it was more just to let you know that don't need millions of reads for model training. Stick with Guppy for both questions (1, 2), you take the reference sequence from mapping anyway so the caller isn't important.

HTH,

Chris.

menickname commented 4 years ago

Dear @iiSeymour

I have successfully generated the chunkify file, however when running the Bonito convert-data command I get the following message:

_$ ./convert-data --chunks 1000000 ../../Bonito_Train/Bonito_Nick.hdf5 ../../Bonito_Train/ [ 0.07494327 0. -0.04496622 ... 0.3897055 0.20984128 0.52460355] Traceback (most recent call last): File "./convert-data", line 226, in main(parser.parse_args()) File "./convert-data", line 97, in main print(read_id, squiggle_duration, len(pointers), mapped_off_the_end, sequence_length) NameError: name 'mapped_off_theend' is not defined

I went through all the previous steps again, however could not figure out where something went wrong. Thank you in advance.

iiSeymour commented 4 years ago

Sorry @menickname, I had left a bad debug print in the conversation script, can you do a git pull and try again.

menickname commented 4 years ago

@menickname , indeed! Performs perfectly now. The 4 .npy files have been generated and I will try to start up the training now on HPC GPU resources.

Thank you for your support.

menickname commented 4 years ago

Dear @iiSeymour

The first Bonito train is currently running on our HPC-GPU cluster. However, with the current release we are not able to run on multi-GPU. In the https://github.com/nanoporetech/bonito/issues/13#issuecomment-597711735 issue I saw you suggested to add some patches to the train.py file. Is there a specific reason this has not been implemented yet in the code itself?

Thank you in advance.

iiSeymour commented 4 years ago

On the multi-gpu support:

I found DataParallel could hang multi-gpu systems without NVLink/NVSwitch so I haven't merged it yet, guarding the import and uses behind --multi-gpu is probably safe so I'll look to get this in master.

I will look to get this merged in today.

boegel commented 4 years ago

@iiSeymour Do you have more information on that hang issue?

How does it manifest, etc., so we can keep an eye on it?

iiSeymour commented 4 years ago

There is a PyTorch thread on the issue here.

The hanging happens on the import of DataParallel so you never see the first epoch progressing. See this comment, try running p2pBandwithLatencyTest which can be found in the samples directory of your CUDA install to check your system's device to device latency.

This is all single node multi-gpu b.t.w and I can confirm Nvidia DGX systems are fine as they use NVLink/NVSwitch. Is multi-node something you are wanting to look at?

boegel commented 4 years ago

p2pBandwidthLatencyTest doesn't seem to be there anymore in CUDA 10.1.243...

We have 4 NVIDIA Tesla V100 GPUs in each box, so we're only looking at single node multi-GPU for now, which would be a nice step up over single GPU. A next step could be multi-node, since our GPU cluster has Infiniband (double EDR), but focus is on single-node multi-GPU for now...

boegel commented 4 years ago

Seems like we do have NVLink:

$ nvidia-smi nvlink -s
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-7de090ff-01c3-5997-3e7a-af13b14a1b86)
     Link 0: 25.781 GB/s
     Link 1: 25.781 GB/s
     Link 2: 25.781 GB/s
     Link 3: 25.781 GB/s
     Link 4: 25.781 GB/s
     Link 5: 25.781 GB/s
# similar output for other 3 GPUs
iiSeymour commented 4 years ago

Yes, it looks like you are good!

The samples are optional; they are here if you are interested - https://github.com/NVIDIA/cuda-samples

menickname commented 4 years ago

Dear @iiSeymour

Bonito train is running perfectly on multi-gpu after installing the patched Bonito version. Please find here the functional version https://gist.github.com/boegel/e0c303497ba6275423b39ea1c10c7d73 from @boegel. This one works on our HPC-GPU resource and fixes 2 of the afore-mentioned issues.

Still 2 questions:

  1. When increasing batchsize with multi-GPU resources, the memory consumption does indeed increase, however speed does not increase. E.g. going from batchsize 128 on 1 GPU to batchsize 256 on 2 GPUs does not increase speed of Bonito training.
  2. What would be required to get this running on multiple nodes? We got 10 nodes with each 4 GPUs available. Could still increase the speed up to 8, 12, or 16 GPUs in multi-GPU mode.

Thank you in advance.

iiSeymour commented 4 years ago
  1. How are you measuring speed?
  2. I was thinking about looking at RaySGD for multi-node.
menickname commented 4 years ago
  1. I am looking at how progress goes. 1 GPU results in 1% progress in 1 minute. 2 GPUs result in 2% progress in 1 minute. Both having 659240 processed "parts" in total per "round". When running 2 GPUs with batchsize 256 instead of 128, the speed stays at 2 GPUs resulting in 2% progress in 1 minute.
  2. Passed to @boegel .
boegel commented 4 years ago

For distributed training, I think @iiSeymour is planning to look into adding support for leveraging RaySGD from Bonito. I wish I had time to help out with that, but I have way too much on my plate already... If the necessary changes have been made to Bonito, I'm certainly willing to help test it though!

iiSeymour commented 4 years ago

Version 0.1.2 is on PyPI now and includes the mutli-gpu patch along with the reworked solution for including the old scripts directory.

What was scripts/convert-data is now available as bonito convert and bonito download has been added for handling pretrained models and training data, full details here.

DelphIONe commented 2 years ago

Hi, thanks for your tool :-) I try to train a RNA model... but Bonito has only DNA model so I have used tayaki as start point with prepare_mapped_reads.py ../fast5/ params_per_read.tsv outputTayaki.fast5 ~/softs/taiyaki/models/mGru_flipflop.py refFromSam.fasta As @menickname, I have successfully generated the chunkify file. I think :

import h5py f = h5py.File('outputTayaki.fast5', 'r') list(f.keys()) ['Reads']

With tayaki v5.0.0 only because with the last version (with mLstm_flipflop.py advised for guppy compatibility), I had : list(f.keys()) ['Batches', 'read_ids']

So, I'm happy :-) and I hope to be on the right track.

However when running the Bonito convert command I get the following message:

bonito -v bonito 0.6.1

bonito convert --chunksize 1000000 outputTayaki.fast5 bonito_convert /home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py:76: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. reads = np.random.permutation(sorted(reads.items()))

preparing training chunks 0it [00:00, ?it/s] Traceback (most recent call last): File "/home/xxx/miniconda3/bin/bonito", line 8, in sys.exit(main()) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 113, in main training_chunks = chunk_dataset(training, args.chunksize) File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 70, in chunk_dataset chunks, targets = zip(*tqdm(take(all_chunks, num_chunks), total=num_chunks)) ValueError: not enough values to unpack (expected 2, got 0)

I have tried with --chunksize 100000 but same error. Can you help me please ? Thank you in advance :-)

DelphIONe commented 2 years ago

Without argument chunksize (by default is 3600...) it works ! But I have some doubt. Which impact of this parameter chunksize ? Thanks for your answer