haotianteng / Chiron

A basecaller for Oxford Nanopore Technologies' sequencers
Other
122 stars 53 forks source link

training problem #35

Closed huangnengCSU closed 6 years ago

huangnengCSU commented 6 years ago

I use the given dataset from the website which named "pass.tar.gz" and using file_batch.py to generate the batch file. Then I run chiron_train.py to train the model. Besides the file directory parameters, all other parameters I use the default setting. It always output No valid path found and the loss is Inf. Can you help me how to solve the problem? Thanks

haotianteng commented 6 years ago

The batch_size of the file_batch.py has to be set to a bigger number for a valid batching. for example: python file_batch.py --input --output --batch 100000 --max None

The bigger the batch_size the huger each batch file, a size of 100000 and length of 512 batch file typically have a size ~ 250Mb.

max argument is the maximum number of batch files, set to None to transfer all the reads into file batch.

I have changed the default batch_size in the file_batch.py to 10000.

huangnengCSU commented 6 years ago

but if set the --batch too large, the binary file output will not work since do not meet the condition "while len(event) > FLAGS.batch: ".

haotianteng commented 6 years ago

Is 100000 too large? I think I run it with this setting and it gives me several batch files. Typically 10000-100000 should be good numbers as far as the pass.tar.gz has enough reads. What pass.tar.gz file you used, from Lambda or E.coli?

huangnengCSU commented 6 years ago

I find if the parameter --batch larger than the --length, then the ctc_decoder will work. So maybe you just increase the --batch a little.

huangnengCSU commented 6 years ago

I use the dataset you give me. There is two dataset, one named "train_mix_hel.tar.gz", the other named "pass.tar.gz". I use the "pass.tar.gz".

haotianteng commented 6 years ago

Okay as long as it can train, it should be fine. But notice that with a --batch X and --max Y, you will only have XY events to train, which XY should be like 100K to get a valid model. And many small batch files can also make the queue input less efficient, so I still suggest a batch > 10K.

huangnengCSU commented 6 years ago

ok. Thanks.

haotianteng commented 6 years ago

I have prepared file batches here if you want to use: https://data.genomicsresearch.org/Projects/basecall/Ecoli-S10/file_batch

wget -r -A 'data*' https://data.genomicsresearch.org/Projects/basecall/Ecoli-S10/file_batch/ should do the job

AEDWIP commented 6 years ago

Hi Haotian Teng

I am interested in try to reproduce your published results. I noticed the data files you made available have a file suffix ".bin". I am not sure how to use them. Is this some sort encoded binary file format?

I think maybe the wget command is not correct? it winds up downloading a lot unexpected stuff. I am not sure what I need to reproduce your original results. I assume I only need Ecoli and Lambda?

$wget --no-check-certificate -r -A 'data*' https://data.genomicsresearch.org/Projects/basecall/Ecoli-S10/file_batch/
$ ls data.genomicsresearch.org/Projects/basecall/
Basecallers_Benchmark/ Ecoli-S10/             GN003_R9/              Labelling/             TBR9.4/
CTC-BNS/               Ecoli-S18/             Human_CHR19/           Lambda_R9.4/
$ 

Kind regards

Andy

haotianteng commented 6 years ago

Yes, they are binary files, chiron_train.py can read this file. The file contains blocks with following format: <1H512f1H512b First Unsigned Short is indicating the actual length of the signal, and the following 512 Float is the normalized signal (depending on the set segment length), and the next unsigned short is the actual length of label, followed by 512 Bytes (A-0 C-1 G-2 T-3 ) of label. And then the next block with the same format.

The format can also be found in the data.meta file.

2018-03-28 11:43 GMT+10:00 Andy notifications@github.com:

Hi Haotian Teng

I am interested in try to reproduce your results. I noticed the data files you made available have a file suffix ".bin". I am not sure how to use them. Is this some sort encoded binary file format?

Kind regards

Andy

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/haotianteng/Chiron/issues/35#issuecomment-376731759, or mute the thread https://github.com/notifications/unsubscribe-auth/AKo3X6MUoqFJAc-YoT2q_DTdZ_qdmTIPks5tiuqqgaJpZM4S7DMI .

-- Teng Haotian University of Queensland, Queensland, Australia +61 0426116017