How to train a new model

yipukangda commented 4 years ago

Hi,

I have submitted an issue on HELEN repo ( here is the link ), and been suggested to here, I have tried pepper with bonito_microbial model, the result is not bad (median accuracy is about 98%), but maybe it could be better if I train a new model with data generated in out lab, so I come here for a help.

Thanks

kishwarshafin commented 4 years ago

Thanks for opening this issue @yipukangda.

As you are using microbial model, I guess you have microbial data? You need these files: 1) a FASTA file containing a draft assembly (possibly Shasta) 2) a FASTA file containing a reference assembly (truth for the microbial species) 3) a FASTQ file containing your reads.

These are the steps you need to follow next: Install pomoxis (https://github.com/nanoporetech/pomoxis) and run this command to align your draft assembly to the reference assembly.

mini_align -c 100000 -P -m \
-r assembly.fasta \
-i reference.fasta \
-t 33 \
-p truth_2_assembly.bam

And generate an alignment between reads to the assembly using minimap2:

minimap2 -ax map-ont -t 32 assembly.fasta reads.fastq | samtools view -hb -F 0x904 > unsorted.bam
samtools sort -@32 -o reads_2_assembly.bam unsorted.bam
samtools index -@32 reads_2_assembly.bam

Once done, please generate training images like this:

pepper_train make_train_images \
--bam reads_2_assembly.bam \
--fasta  assembly.fasta \
--region contig:start-end \
--truth_bam truth_2_assembly.bam \
--output_dir /path/to/train_images/ \
--threads 32

pepper_train make_train_images \
--bam reads_2_assembly.bam \
--fasta  assembly.fasta \
--region contig:start-end \
--truth_bam truth_2_assembly.bam \
--output_dir /path/to/test_images/ \
--threads 32

You can use a train_regions.bed and test_regions.bed and provide it via --region_bed instead of using --region depending on your preference of train-test split. If you don't provide --region, it will generate training images from the entire genome.

pepper_train train_model \
--train_image_dir /path/to/train_images/ \
--test_image_dir /path/to/test_images/ \
--output_dir /model_out/ \
--batch_size 64 \
--epoch_size 100 \
--num_workers 4 \
--gpu

I'm not sure if you'll face issues if you don't give --gpu parameter as I haven't trained on CPUs in a while. If you do, please let me know and I'll add some patch for you.

yipukangda commented 4 years ago

@kishwarshafin Thanks so much, we have some microbe genome sequencing date at present how many train genomes should I have? Will a 50X E.Coli genome data be enough?

In addition, the read length now is not very long (median is ~5kb) and I do not get very well assembly result from Shasta (several hundred contigs), so I used flye instead (3 contigs).

I will try follow this instruction and feed back you soon as we have any results.

kishwarshafin commented 4 years ago

Hi @yipukangda ,

Flye wouldn't be a problem in this case. Any assembler would work.

"how many train genomes should I have? Will a 50X E.Coli genome data be enough?" It's up to your experimental design. If you break the E. Coli genome to train-test regions, you can train a model with 50x data. I have mostly worked on human data, so I can't comment on how much that would affect the model, but I don't think it will.

One way to increase robustness is to use subsampled sets of the BAM. So downsample the bam to 40x, 30x, 20x, 15x, put all the training images in one directory and train a model on that. That should give you a robust model.

yipukangda commented 4 years ago

Hi @kishwarshafin ,

I have trained a model with E.Coli data (with downsample 80, 60, 40 %, respectively, split region of train: test as 9:1), it seems will on E.Colli data set(~1% acc increasing), but not as well on Staphylococcus aureus data(nearly no acc increasing), is it nessicery to train the model with more bacterial spices data to get an more general model, or try other training strategies?

Another thing, I set epoch to 100, but the process stopped in 14th epoch, does it mean early stop occur?

Thanks.

kishwarshafin commented 4 years ago

@yipukangda ,

I think the issue is not enough data. If I'm correct then the homopolymer distribution between E. Coli and Staphylococcus is significantly different. For microbial data, the only usable model I generated used 5/6 species from the Zymo dataset. We had to plot homopolymer distribution and a few other things to pick the best set of training data. Also, 14 epochs isn't that much for microbial data. The early stopping is by default turned off, only occurs when we do hyperparameter training which is not available in this codebase. Can you please paste the log here so I can see why it crashed?

yipukangda commented 4 years ago

@kishwarshafin ,

I repeat training and this time it finish after 100 epoch, I will check homopolymer error and prepare to buy more microbes for training, and show you the result as soon.

Thanks.

kishwarshafin commented 4 years ago

@yipukangda ,

Going to close this issue. Please re-open if you have other questions.

yipukangda commented 3 years ago

Hi @kishwarshafin , Is polish model of pepper still valuable? It has been a long time for our lab to optimize the biochemistry system to get enough sequencing data.

kishwarshafin / pepper

How to train a new model #12