google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

About making ground truth. #74

Closed Crispy13 closed 10 months ago

Crispy13 commented 10 months ago

Hi.

According to your video, the labels is from an assembly, i.e. reference sequence fasta right?

I'm a noob here so I'm not sure. But is it okay to use the reference bases as they are? There may be real variant bases in reads.

If a position has germline variant C (ref is A) with AF 100%, shouldn't the correct label for that position be C instead of reference base A?

kishwarshafin commented 10 months ago

Hi @Crispy13 ,

We use the "reference" and "reads" from the same sample to reduce this confusion. For example, when DeepConsensus was first trained, we used HG002 assembly (or reference) as the truth for the reads. There are haploid samples like CHM13 that reduce this issue further. So, usually when we use the assembly of the sample sample we don't expect all of the variation to be present in the assembly.

I hope this helps. You can read more about genome assembly here

Crispy13 commented 10 months ago

Thank you for replying