haotianteng / Chiron

A basecaller for Oxford Nanopore Technologies' sequencers
Other
122 stars 53 forks source link

Confusion about labeled samples from the paper #88

Closed drewstone closed 5 years ago

drewstone commented 5 years ago

Hello, I'm confused about training from the paper. You were clear on how you partitioned the input signal data (by sliding windows of length 300 with step sizes of 30) but it was not clear how you partitioned labelings of these signals for outputs.

Can you elaborate on how you gave each length 300 signal segment a label for training? Do you kmer expand the base reading in some fashion? It seems the uppercase K used in the paper is never well documented afterwards. I'm also having a hard time finding it in this repo.

haotianteng commented 5 years ago

When one have the labeled dataset, typically a dataset like this: 1-15 A 15-20 C 20-27 C ...

The first column gives the location of the base(2nd column). So to transform it into the format that Chiron used, accumulate the signal by each base(e.g. here signal point 1-15 15-20 20-27...), until the next base makes the signal longer than 300, and then padding the signal to 300 length.

You can find this part of the code in chiron_input.py, the function read_tfrecord I hope this solves your question.