Closed Francesco-Carlucci closed 2 months ago
To train a bonito model, train_original.py
with the default values and a --window-size 2000
would be the same as we did in the paper. Do NOT set the --warmup-steps 0
.
For the evaluation see the README.md, you don't have to concatenate all the fastq's. You do have to concatenate the fasta containing the reference for each read.
source venv3/bin/activate
python3 evaluate.py \
--basecalls-path demo/model1 \
--references-path demo/reference.fasta \
--model-name model1
Thank you very much for the prompt answer! I actually have a .fna file for each species in the genomes folder, they should be equivalent to fasta, aren't they? The basecalling script seems really slow, especially compared to training, it says almost 2 days estimated time on the wick test, 138975 reads, is that normal? I've used this command, for reference:
python ./scripts/basecall_original.py --model bonito --fast5-dir ./wick_test \ --checkpoint ./trained_model/checkpoints/checkpoint_182021.pt --output-file ./trained_model/wick_basecalls.fastq
Much appreciated, good evening
It is normal for basecalling to be slow because there is no multiprocessing. Making it faster is non-trivial, and since this is meant to be a research tool, and not in production for basecalling I did not optimize for speed. Some options to make it faster:
--batch-size (default:64)
, this should make it faster if reads are long.Regarding the .fna
file, that will currently not work, as the evaluation script expects a fasta file with the true reference for each sequence. But some people also asked for the possibility to just pass the reference genome, so I will be adding that sometime this week.
Thanks again,
I've found the .fasta files, so the .fna support is not needed anymore for me, thanks all the same. I'll try to launch a basecalling process for each folder, with an higher batch-size.
Another question, is tombo resquiggle needed for the dataset wick, before applying data_prepare_numpy? I'm afraid i've misinterpreted the Data annotation section of the paper.
Best regards, have a good afternoon
Before data_prepare_numpy.py
the data has to be resquiggled using tombo. Otherwise it is not possible to match the raw signal with the reference. If my memory serves right, I think the uploaded Wick dataset was already resquiggled, but I might be wrong here.
Good afternoon, I've tried to train a bonito model but i failed to reproduce your results. In order i've done:
epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint 4,181000,98,1.3572734911441804,1.3543856143951416,1.3572734911441804,1.3543856143951416,0.05005340824414036,0.048987125811690554,1.2668449560482334e-05,no 4,181500,98,1.3592135450839997,1.3466198444366455,1.3592135450839997,1.3466198444366455,0.05179553941345434,0.06894475484809542,1.2243823054955195e-05,no 4,182000,98,1.3583633615970612,1.3630245923995972,1.3583633615970612,1.3630245923995972,0.041625096266821614,0.039073898001589814,1.1855894933927906e-05,no
5210_N125509_20170130_FN2002036049_MN19067_sequencing_run_flow_cell_022_Klebs_358_170_065_177_7F_014_19425_ch141_read466_strand,,,,,,,failedmapping,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
I don't know what could the problem be, so thank you in advance for any help or suggestion you could deliver. The evaluate.py command works with the trained model you provided in another issue, inter2000, it doesn't says failedmapping.
Many Thanks!
Some things to check:
I attach the whole log file here: train.log
Yes, we have used data_prepare_numpy with window size 2000 to generate 2 .npz files. If i open the .npz files with numpy there are 2 arrays, 'x' and 'y', with shapes respectively (2313687, 2000) for data_0.npz and (2329813, 2000) for data_1.npz.
I've also tried to train a bonito model on a smaller dataset, but it doesn't seem to change anything.
I hope this help, thanks!
So it looks like your model is not learning. Here are the first few lines of the model I trained of essentially the same model you are trying to train.
epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,117,1.4382339773292772,1.4179329872131348,1.4382339773292772,1.4179329872131348,0.274332847021557,0.29803819757947747,9.98e-05,no
0,1000,120,1.136046401500702,0.9155606627464294,1.136046401500702,0.9155606627464294,0.5761952086294764,0.6015458366654789,0.0001998,no
0,1500,115,0.8061170952320099,0.6897753477096558,0.8061170952320099,0.6897753477096558,0.6473707248306947,0.6508228569068903,0.0002998,no
0,2000,112,0.6243208237886428,0.6236732602119446,0.6243208237886428,0.6236732602119446,0.7261366617975064,0.722776950226298,0.0003998,no
0,2500,117,0.5819001960158348,0.4776695966720581,0.5819001960158348,0.4776695966720581,0.7595367815283331,0.757856846671066,0.0004998,no
0,3000,117,0.503957610487938,0.448106050491333,0.503957610487938,0.448106050491333,0.7524687553548679,0.7726426229263375,0.0005998,no
0,3500,115,0.4649578145742416,0.5275828242301941,0.4649578145742416,0.5275828242301941,0.7585764921696556,0.7679873361488185,0.0006998,no
0,4000,115,0.4495013102889061,0.4335578382015228,0.4495013102889061,0.4335578382015228,0.7955427081098727,0.774489682259517,0.0007997999999999999,no
The CRF loss value starts for both at similar values, but yours does not go down. Learning rates are equivalent. So it's a bit puzzling what is happening.
You could check if the numpy arrays you prepared are correct by trying to basecall them using the trained model. A quick way to do it would be to run train_original.py
and then set the learning rate to 0 --starting-lr 0
, so the model is not updated, and use the checkpoint of the already trained model --checkpoint TRAINED.chk
.
Good afternoon, I've made this test on a smaller dataset (just one folder of wick train set) that i converted to npz with the same command as the previous one, and these are the first lines of the training log:
epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,86,1.4717624717819429,1.4037187099456787,1.4717624717819429,1.4037187099456787,0.22720244389637168,0.23271096327291818,1.1106462142156984e-05,no
0,1000,94,1.401106917142868,1.8368217945098877,1.401106917142868,1.8368217945098877,0.14873614312191014,0.2047222034156143,1.1106462142156984e-05,no
0,1500,97,1.4049187166690826,1.4061570167541504,1.4049187166690826,1.4061570167541504,0.1274801069221054,0.11124629518294676,1.1106462142156984e-05,no
0,2000,96,1.399462033033371,1.3520772457122803,1.399462033033371,1.3520772457122803,0.08787595051639996,0.09374311198525079,1.1106462142156984e-05,no
0,2500,95,1.3939985973834992,1.6071970462799072,1.3939985973834992,1.6071970462799072,0.08318666725657548,0.06499962891499195,1.1106462142156984e-05,no
0,3000,97,1.404309557914734,1.3702248334884644,1.404309557914734,1.3702248334884644,0.07229826622585915,0.06453106644041247,1.1106462142156984e-05,no
0,3500,100,1.3913548927307129,1.3831517696380615,1.3913548927307129,1.3831517696380615,0.037177749471255926,0.03497691597106505,1.1106462142156984e-05,no
0,4000,98,1.4039770028591156,1.3660919666290283,1.4039770028591156,1.3660919666290283,0.06742730838160624,0.055966194695776725,1.1106462142156984e-05,no
0,4500,100,1.392589928150177,1.3884905576705933,1.392589928150177,1.3884905576705933,0.05138375549009372,0.04264943436864671,1.1106462142156984e-05,no
0,5000,96,1.3949994075298309,1.3609488010406494,1.3949994075298309,1.3609488010406494,0.047968022968022966,0.031926847570115546,1.1106462142156984e-05,no
I've put --starting-lr 0
but he loaded the lr_scheduler_state
from the checkpoint so it started from it's last lr, 0.000011.
The accuracy here is not always 0, so something changed, but it starts generating strange values after some steps. I don't see any difference between the architectures of the two models (the checkpoint and the one trained from scratch) if i print them.
I'm adding how i obtained the data, to be more precise. I selected the Acinetobacter_baumannii_AYP-A2 folder and i applied on all its fast5 the command:
python ../basecalling_architectures/scripts/data_prepare_numpy.py --fast5-dir ./all_fast5 --output-dir ./train_numpy --total-files 21 --window-size 2000 --window-slide 0 --n-cores 22 --verbose
Thanks!
Good afternoon, sorry to bother you again;
In order to reproduce these results, could you send me the .npz file you have used for training on the cross-species task? So i can compare them with my files, be sure the training script works and go on training my custom model.
Alternatively, could you indicate a precise set of commands, or maybe a notebook, to do exactly what you made? That would really help me understand what went wrong and how to properly use the code.
Thanks very much in advance
I have tried to train a model with only a couple of my .npz files and got this result:
epoch step time loss.global.train loss.global.val loss.crf.train loss.crf.val metric.accuracy.train metric.accuracy.val lr checkpoint
0 1 500 15 1.392632 1.275242 1.392632 1.275242 0.336307 0.339942 0.0001 no
0 2 1000 22 0.899456 0.646763 0.899456 0.646763 0.690396 0.678832 0.0002 no
0 3 1500 29 0.560034 0.486229 0.560034 0.486229 0.7664 0.762412 0.0003 no
0 4 2000 36 0.448481 0.448585 0.448481 0.448585 0.779204 0.772098 0.0004 no
I used the following command:
python train_original.py --data-dir tmp --output-dir tmp2 --model bonito --window-size 2000 --task inter
Here are the 2 npz files that I used: https://surfdrive.surf.nl/files/index.php/s/LD1JkLA9HIhrPGO
Can you try if this works on your end?
Hello, I've tried with your files and train_original.py is indeed training, that's the first epoch:
epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,79,1.418561204401907,1.3640797138214111,1.418561204401907,1.3640797138214111, 0.0,0.0 ,9.98e-05,no
0,1000,67,1.321640213727951,1.1770823001861572,1.321640213727951,1.1770823001861572,0.4703033055559377,0.45930515386962445,0.0001998,no
0,1500,72,0.81368168592453,0.656623363494873,0.81368168592453,0.656623363494873,0.6658332757249669,0.6912611148688914,0.0002998,no
0,2000,72,0.5636749601364136,0.521649956703186,0.5636749601364136,0.521649956703186,0.7912311252721519,0.7782094949588408,0.0003998,no
0,2500,69,0.5132035675048828,0.4690845310688019,0.5132035675048828,0.4690845310688019,0.8534333877185656,0.7625201221856244,0.0004998,no
0,3000,72,0.46973459190130235,0.41742467880249023,0.46973459190130235,0.41742467880249023,0.7664896481276602,0.8089638573429603,0.0005998,no
0,3500,80,0.43957809001207354,0.3311671316623688,0.43957809001207354,0.3311671316623688,0.8021788776927748,0.8418767667620856,0.0006998,no
0,4000,69,0.41905938744544985,0.3828888535499573,0.41905938744544985,0.3828888535499573,0.8217322258216366,0.819314819986863,0.0007997999999999999,no
0,4500,71,0.4135214169025421,0.383678674697876,0.4135214169025421,0.383678674697876,0.8361298315602466,0.8113390903338029,0.0008998000000000001,no
0,5000,81,0.40206520301103593,0.377856969833374,0.40206520301103593,0.377856969833374,0.8532733647633087,0.8038791078457247,0.0009998000000000001,no
0,5500,71,0.3904891494512558,0.4950909912586212,0.3904891494512558,0.4950909912586212,0.7920281368193258,0.7489867524663388,0.000999559959961458,no
0,6000,69,0.3752949371933937,0.2973739206790924,0.3752949371933937,0.2973739206790924,0.8790456056786631,0.864681271093211,0.0009982335535517004,no
0,6500,75,0.36834737369418147,0.4803057909011841,0.36834737369418147,0.4803057909011841,0.832926062916022,0.7483230317715633,0.0009960231509199102,no
0,7000,71,0.3561938584148884,0.3697338402271271,0.3561938584148884,0.3697338402271271,0.8476541952160295,0.81979966603449,0.000992932713652975,no
Are these .npz made from data of the wick dataset? In particular the train part downloaded with download_wick.sh. Because if train_original works i've either downloaded the wrong data or made an error in converting with data_prepare_numpy.
Could you write me the data_prepare_numpy command you have used, i've used it like this:
python ../basecalling_architectures/scripts/data_prepare_numpy.py --fast5-dir ./all_fast5 --output-dir ./train_numpy --total-files 21 --window-size 2000 --window-slide 0 --n-cores 22 --verbose
Finally, my train_original.py script has no option --task, are the training parameters different between tasks?
Thanks a lot! Good afternoon
These npz files are from the Wick dataset, it's only 2/100 since they are quite big.
I used this to generate them:
python data_prepare_numpy.py \
--fast5-list inter_task_train_reads.txt \
--output-dir output_dir \
--total-files 100 \
--window-size 2000 \
--window-slide 0 \
--n-cores 10
Do not worry about the --task
option, it's just a parameter to rename the output directory that I used, has no influence in the training at all.
Since you mentioned that you did not resquiggle the data I would do that, perhaps try if first on one of the species only.
Good afternoon,
I've made some other tests and i can now confirm that the problem was in fact the resquiggle. The wick dataset seems resquiggled, tombo says so, but in reality another resquiggle with --overwrite option is needed, using the .fasta files for the folders of the train part of wick and the .fna files for the test part. Also generating multiple .npz files is beneficial, so the validation split is more accurate.
I've now managed to perform the whole pipeline and the bonito model i've trained seems comparable with the inter-2000 checkpoint you had provided in another issue.
Thanks a lot for everything, have a nice day!
Hello @Francesco-Carlucci, could you let us know how you ascertained that the data was already squiggled? We are trying to annotate the train/Acinetobacter_ursingii_MINF_9C genome specifically, and running into dead ends. What order of tombo commands did you have to run to manage to get the fasta information into the fast5 file before chunking the file? Did you have to call some tombo preprocess command? We specifically get the tombo error "Reads do not contain basecalls" which would seem to indicate that the fast5 file hasn't been resquiggled.
Indeed that indicates that you first have to add the basecalls info into the fast5 file, before being able to resquiggle the data. See https://nanoporetech.github.io/tombo/resquiggle.html.
In essence you have to run first tombo preprocess annotate_raw_with_fastqs
In order to be able to run that command we need to have the fastq files which are not available for the wick train data. There are only read_references.fasta files, which cannot be used for annotation. We used the download scripts from the nanopore_benchmark repo to download everything. Could you provide with a link to the fastq files you have used for wick train portion? If you don't have any, how did you manage to annotate, resquiggle and then chunk them?
I believe Wick's data was already resquiggled. Hence no fastq files were necessary. You can try to chunk it, if it's not resquiggled you should get an error.
If it's not resquiggled, you can basecall the data yourself, using https://github.com/nanoporetech/dorado for example.
If i recall correctly on the wick dataset another resquiggle is needed, using as reference the .fasta files for the folders of the train set and the .fna files for the test set. The command for tombo 1.51 is the one described in the other repo, basecalling_architectures:
tombo resquiggle
The --overwrite option is needed.
I can't remember if Acinetobacter ursingii had fastq, fasta or fna, but i had made a script to resquiggle using tombo the whole dataset with the correct reference file, it's "src/sbonito/apply_tombo.py" in this repository:
https://github.com/MLinApp-polito/mla-prj-23-mla-prj-11-gu3/tree/main
Hope this helps you, Have a nice day!
@Francesco-Carlucci Thank you for your reply. I tried performing the resquiggling with the fasta files for wick train set, but it didn't work. The link you sent me is not public. Could you provide the script?
Thank you again, have a nice day!
What the script does is just the following command:
tombo resquiggle
using as reference_file the read_references.fasta file if it is in the directory of the specie, otherwise species .fna file in genomes/. I hope this works, do you obtain the "Reads do not contain basecalls" error when you chunk it with data_prepare_numpy ?
This repository is public and it's the same: https://github.com/Francesco-Carlucci/GU03-bioinspired-basecaller/blob/main/sbonito/apply_tombo.py
Good afternoon, I'm trying to compare a custom basecaller architecture to the data of bonito reported in your paper.
I would like to ask you how have you trained the model properly_trained_bonito, have you used the script "train_original.py" with the default values? Should i set --warmup-steps to 0 because bonito is a recurrent model?
I'm trying to test this model on the wick dataset, that has a folder for each species, if i launch the basecall_original.py script on all the fast5 files of the test part how can i use the evaluate command then? Should i concatenate all the fastq too?
Many Thanks!