marbl / CHM13

The complete sequence of a human genome
Other
908 stars 98 forks source link

Blacklist for the T2T Reference Genome #33

Closed karaquaid closed 3 years ago

karaquaid commented 3 years ago

Is there any equivalent of an ENCODE blacklist available for the T2T reference genome?

arangrhie commented 3 years ago

Hello @karaquaid,

The short answer is no, we don't have a 1:1 matching list available that is equivalent to the ENCODE blacklist. I assume you are looking for a list of regions which frequently show anomalous signals in next-generation sequencing experiments?

I can imagine that signal anomaly could rise from misassemblies, such as collapses or falsely duplicated sequences, or at regions with copy number differences private to CHM13.

For the forma two, we have extensively validated the consensus and listed possible issues here, however most are associated with lower consensus quality (Low_Qual and Error_Kmer). We didn't find any possible duplicated sequences except two chimeric haplotype joins and one ~10kb collapse.

Regarding the private variants in CHM13, this paper extensively compares mappability and variant calling using CHM13 as a reference compared to GRCh38, and shows that CHM13 as a reference improves mapping rate in short and long reads: Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. A complete reference genome improves analysis of human genetic variation. bioRxiv, 2021. Which suggests variants private to CHM13 are less frequently found compared to GRCh38. In Fig. 1 and Fig. S1.1, the ENCODE blacklist track overlap a lot with the non-syntenic region or SDs, which may not produce signal abnormalities when using CHM13 as the reference due to the better resolved sequences.

It is possible to have undetected errors, or regions prone to false mappings due to lower consensus quality or shorter read length in Illumina. On this note, the authors of the above paper are discussing to produce an accessibility track similar to what was provided from the 1000 Genomes Project for reliable variant calling in short reads which may be relevant. It will be posted here once it becomes available.

We are also open to host such a track if someone already did generate it and is willing to share with the community.

karaquaid commented 3 years ago

Thanks for your response!