Public raw train dataset availability

marcpaga commented 1 year ago

Thanks for this very interesting piece of work.

I was wondering if there's an available raw sequencing dataset available for training which also contains the true labels.

From your publication I found:

Raw fastq: the fastq file, which as far as I understand contain the raw output from PacBio. That would be the non-consensus very long read.
Data folder: which contains tf.records with the already pre-processed alignment matrices, ready to use for training. I assume these contain also the target true sequence.

Basically, I am asking if there's a file that indicates what is the true complete sequence for each entry in the raw fastq. If not, should I take the deepconsensus predictions, align them against the HG002 genome and take the reference genome as truth?

Best, Marc

pichuan commented 1 year ago

Hi @marcpaga , Can you take a look at https://github.com/google/deepconsensus/blob/r1.1/docs/generate_examples.md first and see if that helps explain how we create training data? Let us know if there's anything unclear from there.

pichuan commented 1 year ago

Hi @marcpaga , I'll close this issue, but feel free to open again if you still have questions.

google / deepconsensus

Public raw train dataset availability #62