google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

Public raw train dataset availability #62

Closed marcpaga closed 1 year ago

marcpaga commented 1 year ago

Thanks for this very interesting piece of work.

I was wondering if there's an available raw sequencing dataset available for training which also contains the true labels.

From your publication I found:

Basically, I am asking if there's a file that indicates what is the true complete sequence for each entry in the raw fastq. If not, should I take the deepconsensus predictions, align them against the HG002 genome and take the reference genome as truth?

Best, Marc

pichuan commented 1 year ago

Hi @marcpaga , Can you take a look at https://github.com/google/deepconsensus/blob/r1.1/docs/generate_examples.md first and see if that helps explain how we create training data? Let us know if there's anything unclear from there.

pichuan commented 1 year ago

Hi @marcpaga , I'll close this issue, but feel free to open again if you still have questions.