Properly trained bonito evaluation

Francesco-Carlucci commented 1 year ago

Good afternoon, I'm trying to compare a custom basecaller architecture to the data of bonito reported in your paper.

I would like to ask you how have you trained the model properly_trained_bonito, have you used the script "train_original.py" with the default values? Should i set --warmup-steps to 0 because bonito is a recurrent model?

I'm trying to test this model on the wick dataset, that has a folder for each species, if i launch the basecall_original.py script on all the fast5 files of the test part how can i use the evaluate command then? Should i concatenate all the fastq too?

Many Thanks!

marcpaga commented 1 year ago

To train a bonito model, train_original.py with the default values and a --window-size 2000 would be the same as we did in the paper. Do NOT set the --warmup-steps 0.

For the evaluation see the README.md, you don't have to concatenate all the fastq's. You do have to concatenate the fasta containing the reference for each read.

source venv3/bin/activate

python3 evaluate.py \
--basecalls-path demo/model1 \
--references-path demo/reference.fasta \
--model-name model1

Francesco-Carlucci commented 1 year ago

Thank you very much for the prompt answer! I actually have a .fna file for each species in the genomes folder, they should be equivalent to fasta, aren't they? The basecalling script seems really slow, especially compared to training, it says almost 2 days estimated time on the wick test, 138975 reads, is that normal? I've used this command, for reference:

python ./scripts/basecall_original.py --model bonito --fast5-dir ./wick_test \ --checkpoint ./trained_model/checkpoints/checkpoint_182021.pt --output-file ./trained_model/wick_basecalls.fastq

Much appreciated, good evening

marcpaga commented 1 year ago

It is normal for basecalling to be slow because there is no multiprocessing. Making it faster is non-trivial, and since this is meant to be a research tool, and not in production for basecalling I did not optimize for speed. Some options to make it faster:

Make sure it has access to a GPU
Split the data and run the script in parallel.
Increase --batch-size (default:64), this should make it faster if reads are long.

Regarding the .fna file, that will currently not work, as the evaluation script expects a fasta file with the true reference for each sequence. But some people also asked for the possibility to just pass the reference genome, so I will be adding that sometime this week.

Francesco-Carlucci commented 1 year ago

Thanks again,

I've found the .fasta files, so the .fna support is not needed anymore for me, thanks all the same. I'll try to launch a basecalling process for each folder, with an higher batch-size.

Another question, is tombo resquiggle needed for the dataset wick, before applying data_prepare_numpy? I'm afraid i've misinterpreted the Data annotation section of the paper.

Best regards, have a good afternoon

marcpaga commented 1 year ago

Before data_prepare_numpy.py the data has to be resquiggled using tombo. Otherwise it is not possible to match the raw signal with the reference. If my memory serves right, I think the uploaded Wick dataset was already resquiggled, but I might be wrong here.

Francesco-Carlucci commented 1 year ago

Good afternoon, I've tried to train a bonito model but i failed to reproduce your results. In order i've done:

I've downloaded the wick dataset, using the download_wick.sh script
I haven't performed a resquiggle
I have converted the training set (as in links_wick_data_train.txt) to 2 numpy format files using data_prepare_numpy.
I have used train_original to train a bonito model with the default parameters and --window-size 2000, as you said above.
Unfortunately the train log has strange values for the accuracies, i report the last lines:

epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint 4,181000,98,1.3572734911441804,1.3543856143951416,1.3572734911441804,1.3543856143951416,0.05005340824414036,0.048987125811690554,1.2668449560482334e-05,no 4,181500,98,1.3592135450839997,1.3466198444366455,1.3592135450839997,1.3466198444366455,0.05179553941345434,0.06894475484809542,1.2243823054955195e-05,no 4,182000,98,1.3583633615970612,1.3630245923995972,1.3583633615970612,1.3630245923995972,0.041625096266821614,0.039073898001589814,1.1855894933927906e-05,no

I've performed the basecalling using basecall_original and the trained model
I've used evaluate.py on the resulting fastq, but in the csv there is "failedmapping" almost for all reads: 5210_N125509_20170130_FN2002036049_MN19067_sequencing_run_flow_cell_022_Klebs_358_170_065_177_7F_014_19425_ch141_read466_strand,,,,,,,failedmapping,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

I don't know what could the problem be, so thank you in advance for any help or suggestion you could deliver. The evaluate.py command works with the trained model you provided in another issue, inter2000, it doesn't says failedmapping.

Many Thanks!

marcpaga commented 1 year ago

Some things to check:

Can you provide the whole log please? I would like to check if it has low accuracy all throughout the training.
How many samples do you have on these numpy files?
Did you also prepare the data with a window size of 2000?

Francesco-Carlucci commented 1 year ago

I attach the whole log file here: train.log

Yes, we have used data_prepare_numpy with window size 2000 to generate 2 .npz files. If i open the .npz files with numpy there are 2 arrays, 'x' and 'y', with shapes respectively (2313687, 2000) for data_0.npz and (2329813, 2000) for data_1.npz.

I've also tried to train a bonito model on a smaller dataset, but it doesn't seem to change anything.

I hope this help, thanks!

marcpaga commented 1 year ago

So it looks like your model is not learning. Here are the first few lines of the model I trained of essentially the same model you are trying to train.

epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,117,1.4382339773292772,1.4179329872131348,1.4382339773292772,1.4179329872131348,0.274332847021557,0.29803819757947747,9.98e-05,no
0,1000,120,1.136046401500702,0.9155606627464294,1.136046401500702,0.9155606627464294,0.5761952086294764,0.6015458366654789,0.0001998,no
0,1500,115,0.8061170952320099,0.6897753477096558,0.8061170952320099,0.6897753477096558,0.6473707248306947,0.6508228569068903,0.0002998,no
0,2000,112,0.6243208237886428,0.6236732602119446,0.6243208237886428,0.6236732602119446,0.7261366617975064,0.722776950226298,0.0003998,no
0,2500,117,0.5819001960158348,0.4776695966720581,0.5819001960158348,0.4776695966720581,0.7595367815283331,0.757856846671066,0.0004998,no
0,3000,117,0.503957610487938,0.448106050491333,0.503957610487938,0.448106050491333,0.7524687553548679,0.7726426229263375,0.0005998,no
0,3500,115,0.4649578145742416,0.5275828242301941,0.4649578145742416,0.5275828242301941,0.7585764921696556,0.7679873361488185,0.0006998,no
0,4000,115,0.4495013102889061,0.4335578382015228,0.4495013102889061,0.4335578382015228,0.7955427081098727,0.774489682259517,0.0007997999999999999,no

The CRF loss value starts for both at similar values, but yours does not go down. Learning rates are equivalent. So it's a bit puzzling what is happening.

You could check if the numpy arrays you prepared are correct by trying to basecall them using the trained model. A quick way to do it would be to run train_original.py and then set the learning rate to 0 --starting-lr 0, so the model is not updated, and use the checkpoint of the already trained model --checkpoint TRAINED.chk.

Francesco-Carlucci commented 1 year ago

Good afternoon, I've made this test on a smaller dataset (just one folder of wick train set) that i converted to npz with the same command as the previous one, and these are the first lines of the training log:

epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,86,1.4717624717819429,1.4037187099456787,1.4717624717819429,1.4037187099456787,0.22720244389637168,0.23271096327291818,1.1106462142156984e-05,no
0,1000,94,1.401106917142868,1.8368217945098877,1.401106917142868,1.8368217945098877,0.14873614312191014,0.2047222034156143,1.1106462142156984e-05,no
0,1500,97,1.4049187166690826,1.4061570167541504,1.4049187166690826,1.4061570167541504,0.1274801069221054,0.11124629518294676,1.1106462142156984e-05,no
0,2000,96,1.399462033033371,1.3520772457122803,1.399462033033371,1.3520772457122803,0.08787595051639996,0.09374311198525079,1.1106462142156984e-05,no
0,2500,95,1.3939985973834992,1.6071970462799072,1.3939985973834992,1.6071970462799072,0.08318666725657548,0.06499962891499195,1.1106462142156984e-05,no
0,3000,97,1.404309557914734,1.3702248334884644,1.404309557914734,1.3702248334884644,0.07229826622585915,0.06453106644041247,1.1106462142156984e-05,no
0,3500,100,1.3913548927307129,1.3831517696380615,1.3913548927307129,1.3831517696380615,0.037177749471255926,0.03497691597106505,1.1106462142156984e-05,no
0,4000,98,1.4039770028591156,1.3660919666290283,1.4039770028591156,1.3660919666290283,0.06742730838160624,0.055966194695776725,1.1106462142156984e-05,no
0,4500,100,1.392589928150177,1.3884905576705933,1.392589928150177,1.3884905576705933,0.05138375549009372,0.04264943436864671,1.1106462142156984e-05,no
0,5000,96,1.3949994075298309,1.3609488010406494,1.3949994075298309,1.3609488010406494,0.047968022968022966,0.031926847570115546,1.1106462142156984e-05,no

I've put --starting-lr 0 but he loaded the lr_scheduler_state from the checkpoint so it started from it's last lr, 0.000011.

The accuracy here is not always 0, so something changed, but it starts generating strange values after some steps. I don't see any difference between the architectures of the two models (the checkpoint and the one trained from scratch) if i print them.

I'm adding how i obtained the data, to be more precise. I selected the Acinetobacter_baumannii_AYP-A2 folder and i applied on all its fast5 the command: python ../basecalling_architectures/scripts/data_prepare_numpy.py --fast5-dir ./all_fast5 --output-dir ./train_numpy --total-files 21 --window-size 2000 --window-slide 0 --n-cores 22 --verbose

Thanks!

Francesco-Carlucci commented 1 year ago

Good afternoon, sorry to bother you again;

In order to reproduce these results, could you send me the .npz file you have used for training on the cross-species task? So i can compare them with my files, be sure the training script works and go on training my custom model.

Alternatively, could you indicate a precise set of commands, or maybe a notebook, to do exactly what you made? That would really help me understand what went wrong and how to properly use the code.

Thanks very much in advance

marcpaga commented 1 year ago

I have tried to train a model with only a couple of my .npz files and got this result:

  epoch step  time  loss.global.train  loss.global.val  loss.crf.train  loss.crf.val  metric.accuracy.train  metric.accuracy.val      lr checkpoint
0     1  500    15           1.392632         1.275242        1.392632      1.275242               0.336307             0.339942  0.0001         no
0     2  1000    22           0.899456         0.646763        0.899456      0.646763               0.690396             0.678832  0.0002         no
0     3  1500    29           0.560034         0.486229        0.560034      0.486229                 0.7664             0.762412  0.0003         no
0     4  2000    36           0.448481         0.448585        0.448481      0.448585               0.779204             0.772098  0.0004         no

I used the following command:

python train_original.py --data-dir tmp --output-dir tmp2 --model bonito --window-size 2000 --task inter

Here are the 2 npz files that I used: https://surfdrive.surf.nl/files/index.php/s/LD1JkLA9HIhrPGO

Can you try if this works on your end?

Francesco-Carlucci commented 1 year ago

Hello, I've tried with your files and train_original.py is indeed training, that's the first epoch:

epoch,step,time,loss.global.train,loss.global.val,loss.crf.train,loss.crf.val,metric.accuracy.train,metric.accuracy.val,lr,checkpoint
0,500,79,1.418561204401907,1.3640797138214111,1.418561204401907,1.3640797138214111, 0.0,0.0 ,9.98e-05,no
0,1000,67,1.321640213727951,1.1770823001861572,1.321640213727951,1.1770823001861572,0.4703033055559377,0.45930515386962445,0.0001998,no
0,1500,72,0.81368168592453,0.656623363494873,0.81368168592453,0.656623363494873,0.6658332757249669,0.6912611148688914,0.0002998,no
0,2000,72,0.5636749601364136,0.521649956703186,0.5636749601364136,0.521649956703186,0.7912311252721519,0.7782094949588408,0.0003998,no
0,2500,69,0.5132035675048828,0.4690845310688019,0.5132035675048828,0.4690845310688019,0.8534333877185656,0.7625201221856244,0.0004998,no
0,3000,72,0.46973459190130235,0.41742467880249023,0.46973459190130235,0.41742467880249023,0.7664896481276602,0.8089638573429603,0.0005998,no
0,3500,80,0.43957809001207354,0.3311671316623688,0.43957809001207354,0.3311671316623688,0.8021788776927748,0.8418767667620856,0.0006998,no
0,4000,69,0.41905938744544985,0.3828888535499573,0.41905938744544985,0.3828888535499573,0.8217322258216366,0.819314819986863,0.0007997999999999999,no
0,4500,71,0.4135214169025421,0.383678674697876,0.4135214169025421,0.383678674697876,0.8361298315602466,0.8113390903338029,0.0008998000000000001,no
0,5000,81,0.40206520301103593,0.377856969833374,0.40206520301103593,0.377856969833374,0.8532733647633087,0.8038791078457247,0.0009998000000000001,no
0,5500,71,0.3904891494512558,0.4950909912586212,0.3904891494512558,0.4950909912586212,0.7920281368193258,0.7489867524663388,0.000999559959961458,no
0,6000,69,0.3752949371933937,0.2973739206790924,0.3752949371933937,0.2973739206790924,0.8790456056786631,0.864681271093211,0.0009982335535517004,no
0,6500,75,0.36834737369418147,0.4803057909011841,0.36834737369418147,0.4803057909011841,0.832926062916022,0.7483230317715633,0.0009960231509199102,no
0,7000,71,0.3561938584148884,0.3697338402271271,0.3561938584148884,0.3697338402271271,0.8476541952160295,0.81979966603449,0.000992932713652975,no

Are these .npz made from data of the wick dataset? In particular the train part downloaded with download_wick.sh. Because if train_original works i've either downloaded the wrong data or made an error in converting with data_prepare_numpy.

Could you write me the data_prepare_numpy command you have used, i've used it like this: python ../basecalling_architectures/scripts/data_prepare_numpy.py --fast5-dir ./all_fast5 --output-dir ./train_numpy --total-files 21 --window-size 2000 --window-slide 0 --n-cores 22 --verbose

Finally, my train_original.py script has no option --task, are the training parameters different between tasks?

Thanks a lot! Good afternoon

marcpaga commented 1 year ago

These npz files are from the Wick dataset, it's only 2/100 since they are quite big.

I used this to generate them:

python data_prepare_numpy.py \
--fast5-list  inter_task_train_reads.txt \
--output-dir  output_dir \
--total-files  100 \
--window-size 2000 \
--window-slide 0 \
--n-cores 10

Do not worry about the --task option, it's just a parameter to rename the output directory that I used, has no influence in the training at all.

Since you mentioned that you did not resquiggle the data I would do that, perhaps try if first on one of the species only.

Francesco-Carlucci commented 1 year ago

Good afternoon,

I've made some other tests and i can now confirm that the problem was in fact the resquiggle. The wick dataset seems resquiggled, tombo says so, but in reality another resquiggle with --overwrite option is needed, using the .fasta files for the folders of the train part of wick and the .fna files for the test part. Also generating multiple .npz files is beneficial, so the validation split is more accurate.

I've now managed to perform the whole pipeline and the bonito model i've trained seems comparable with the inter-2000 checkpoint you had provided in another issue.

Thanks a lot for everything, have a nice day!

riseldakodra commented 5 months ago

Hello @Francesco-Carlucci, could you let us know how you ascertained that the data was already squiggled? We are trying to annotate the train/Acinetobacter_ursingii_MINF_9C genome specifically, and running into dead ends. What order of tombo commands did you have to run to manage to get the fasta information into the fast5 file before chunking the file? Did you have to call some tombo preprocess command? We specifically get the tombo error "Reads do not contain basecalls" which would seem to indicate that the fast5 file hasn't been resquiggled.

marcpaga commented 5 months ago

Indeed that indicates that you first have to add the basecalls info into the fast5 file, before being able to resquiggle the data. See https://nanoporetech.github.io/tombo/resquiggle.html.

In essence you have to run first tombo preprocess annotate_raw_with_fastqs

riseldakodra commented 5 months ago

In order to be able to run that command we need to have the fastq files which are not available for the wick train data. There are only read_references.fasta files, which cannot be used for annotation. We used the download scripts from the nanopore_benchmark repo to download everything. Could you provide with a link to the fastq files you have used for wick train portion? If you don't have any, how did you manage to annotate, resquiggle and then chunk them?

marcpaga commented 5 months ago

I believe Wick's data was already resquiggled. Hence no fastq files were necessary. You can try to chunk it, if it's not resquiggled you should get an error.

If it's not resquiggled, you can basecall the data yourself, using https://github.com/nanoporetech/dorado for example.

Francesco-Carlucci commented 5 months ago

If i recall correctly on the wick dataset another resquiggle is needed, using as reference the .fasta files for the folders of the train set and the .fna files for the test set. The command for tombo 1.51 is the one described in the other repo, basecalling_architectures:

tombo resquiggle --processes <#cores> --dna --num-most-common-errors 5 --ignore-read-locks --overwrite

The --overwrite option is needed.

I can't remember if Acinetobacter ursingii had fastq, fasta or fna, but i had made a script to resquiggle using tombo the whole dataset with the correct reference file, it's "src/sbonito/apply_tombo.py" in this repository:

https://github.com/MLinApp-polito/mla-prj-23-mla-prj-11-gu3/tree/main

Hope this helps you, Have a nice day!

riseldakodra commented 5 months ago

@Francesco-Carlucci Thank you for your reply. I tried performing the resquiggling with the fasta files for wick train set, but it didn't work. The link you sent me is not public. Could you provide the script?

Thank you again, have a nice day!

Francesco-Carlucci commented 5 months ago

What the script does is just the following command:

tombo resquiggle --processes <#cores> --dna --num-most-common-errors 5 --ignore-read-locks --overwrite

using as reference_file the read_references.fasta file if it is in the directory of the specie, otherwise species .fna file in genomes/. I hope this works, do you obtain the "Reads do not contain basecalls" error when you chunk it with data_prepare_numpy ?

This repository is public and it's the same: https://github.com/Francesco-Carlucci/GU03-bioinspired-basecaller/blob/main/sbonito/apply_tombo.py

marcpaga / nanopore_benchmark

Properly trained bonito evaluation #3