rrwick / Deepbinner

a signal-level demultiplexer for Oxford Nanopore reads
GNU General Public License v3.0
124 stars 23 forks source link

balancing training data #2

Closed osilander closed 6 years ago

osilander commented 6 years ago

As you note, balancing results in each barcode having the same number of samples necessarily limited to the number of samples for the least abundant barcode. I have >250K reads for most rapid barcodes, but only 60K for one, and would like to exclude that barcode from the balancing. This means I would only train deepbinner on 11 (RBK) barcodes. I guess that should be possible? However, I don't see an option to specify the barcodes to include in the training (I could be missing it). I could exclude the fast5s that albacore bins into that barcode, and give porechop only the fast5s binned into the other 11, but porechop can still bin some reads into it, so the problem is not (necessarily) solved. Hope that makes sense, unless I'm totally off base.

rrwick commented 6 years ago

Nope, you're not off base, and this is a good point! I've just added a new --barcodes option to the deepbinner balance command. Now you should be able to do this:

deepbinner balance --barcodes 1,2,3,4,5,6,7,8,9,10,12 unbalanced_training_data balanced_training_data

I've also tweaked a few other things in Deepbinner's training and classification modules so they are better able to deal with barcode counts other than 12.

Thanks for the suggestion. Grab the current version (v0.1.1) and let me know if you have any issues with the new functionality!

Ryan