Telomere-to-telomere consortium CHM13 project

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser hub is available for CHM13 and T2T-Primates. Track updates will be made to this hub until integrated into the UCSC Genome Browser for hs1. Legacy UCSC browsers are available for v2.0, v1.0 and v1.1 versions.

An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

Latest assembly release

T2T-CHM13v2.0 (T2T-CHM13+Y)

Complete T2T reconstruction of a human genome with Y. Changes from v1.1 is the addition of a finished chromosome Y from the GIAB HG002/NA24385 sample, sequenced both by GIAB and HPRC. This genome is also available at NCBI (GCA_009914755.4) and at UCSC. Note that even though the UCSC browser shows the Genbank accessions as sequence names on the browser itself, it can load annotations in BED/bigBed/BAM/CRAM/bigWig and other formats or search using the "chr1/2/etc" names.

Previous assembly releases are available below:

T2T-CHM13: v0.7-v1.1
T2T-HG002XY: v0.7-2.7

Downloads

Sequencing data

The sequencing dataset generated for CHM13 is available on this page.

Analysis set

Analysis set for using T2T-CHM13v2.0 (T2T-CHM13+Y) as a reference for mapping based research is available at aws with a README.

chm13v2.0.fa.gz: T2T-CHM13v2.0 assembly with sequences soft-masked using the repeat models discovered by the T2T team. The original sequence accession numbers are shown in the FASTA header.
chm13v2.0_noY.fa.gz: excluding the Y chromosome. This file only contains sequences derived from the CHM13 cell line and is identical to T2T-CHM13v1.1. Use this file for benchmarking assemblies of CHM13.
chm13v2.0_PAR.bed: pseudoautosomal regions (PARs)
chm13v2.0_maskedY.fa.gz: PARs on chrY hard masked to "N"
chm13v2.0_maskedY.rCRS.fa.gz: PARs on chrY hard masked to "N" and mitochondrion replaced with rCRS (AC:NC_012920.1)

Sep. 28 2022 update: all analysis-set fa.gz files have been re-compressed with bgzip. Index files are available at aws with updated md5s in the README.

Gene annotation

JHU RefSeqv110 + Liftoff v5.2: This contains curated annotations of the ampliconic genes on the Y chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 annotation. Additional copies found in T2T-Y were annotated to the closest available gene in RefSeq, allowing multiple genes to have the same common name. This file has been modified to correct special character issues from the original file. More description is available here. Update log from v5 to v5.1 is available here.
UCSC GENCODEv35 CAT/Liftoff v2
- CAT/Liftoff v1 annotation for VEP in Sorted GFF and TABIX index
- Protein coding translated transcripts from CAT/Liftoff v1 annotation. Note, these are transcripts not genes and only searchable by transcript ID (IDs like LOFF_T not LOFF_G).
NCBI RefSeqv110 from FTP
EBI GENCODEv38 r2 from HPRC Projects

Repeat annotation

Cytobands
Segmental Duplications, v2022-03-11 in simple and full bed format
Cen/Sat v2.1: A more comprehensive centromere/satellite repeat annotation. (Re colored to be consistent with the primates Cen/Sat tracks)
RepeatMasker v4.1.2p1.2022Apr14 in bed or native out. Here is a great resource for building a custom RepeatMasker library with new repeat models from the T2T genomes and a walk through for running RepeatMasker.
Composite Repeats, 2022DEC
New Satellites, 2022DEC
chrXY sequence class, v1
Telomere
Y specific annotation

Epigenetic profile

ENCODE, recalled on T2T-CHM13v2.0
HG002 and CHM13 5mC CpG and other methylation from ONT and HiFi

Variant calls

1000 Genomes Project, recalled on T2T-CHM13v2.0. Now available for all chromosomes, for the entire 3,202 samples or the unrelated 2504 samples. Reference sets, bam, and vcf files are also available on AnVIL_T2T_CHRY.
1000 Genomes Project - Allele Frequency by Population, of the unrelated samples, further excluding 14 individuals discovered as first and second degree relatives (more details here).
1000 Genomes Project - Phased with SHAPEIT5, using the above variant calls.
Simons Genome Diversity Project, recalled on T2T-CHM13v2.0. Reference sets, bam, and vcf files are also available on AnVIL_T2T_CHRY.
gnomAD v3.1.2 from FTP: This is a lifted over version from GRCh38, annotated with predicted molecular consequence and transcript-specific variant deleteriousness scores from PolyPhen-2 and SIFT using Ensembl Variant Effect Predictor.
Short-Read Accessibility Mask, with the three masks used to make the combined_mask are available here. See description
ClinVar 20220313, lifted over from GRCh38. See description
GWAS v1.0, lifted over from GRCh38. See description
dbSNP build 155, lifted over from GRCh38. See description
Variants disappearing in GRCh38-Y coordinates, v0.005 when using T2T-Y as a reference, more details are here.

Liftover resources

1:1 Liftover GRCh38 <-> T2T-CHM13v2.0, see description
- GRCh38/hg38 -> T2T-CHM13v2.0: grch38-chm13v2.chain
- GRCh38/hg38 <- T2T-CHM13v2.0: chm13v2-grch38.chain
- Alignment grch38-chm13v2.paf
1:1 Liftover hg19 <-> T2T-CHM13v2.0
- GRCh37/hg19 -> T2T-CHM13v2.0: hg19-chm13v2.chain
- GRCh37/hg19 <- T2T-CHM13v2.0: chm13v2-hg19.chain
- Alignment hg19-chm13v2.paf

Non-syntenic region

Regions non-syntenic (unique) compared to GRCh38 and GRCh37 from above chains
- GRCh38/hg38: chm13v2-unique_to_hg38.bed
- GRCh37/hg19: chm13v2-unique_to_hg19.bed
Regions non-syntenic from T2T-CHM13v1.0 and T2T-CHM13v1.1 plus hg38Y by Aganezov et al. Science, 2022
- T2T-CHM13v1.0: chm13.draft_v1.0_plus38Y.no_snyteny_1Mbp.bed
- T2T-CHM13v1.1: chm13_v1.1_plus38Y.no_snyteny_1Mbp.bed

Notes on downloading files

Files are generously hosted by Amazon Web Services under s3://human-pangenomics/T2T/CHM13 and through this web interface.

Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/ with s3://human-pangenomics/T2T to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://human-pangenomics/T2T/CHM13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://human-pangenomics/T2T/CHM13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. We would appreciate if you would acknowledge and cite the "Telomere-to-Telomere" (T2T) Consortium for the creation of this data. More information about our consortium can be found on the T2T homepage and a list of related citations is available below:

T2T-CHM13v2.0, datasets released along the v2.0 and the T2T-Y chromosome

Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, et al. The complete sequence of a human Y chromosome. bioRxiv, 2022.

The complete sequence of a human genome and companion papers (T2T-CHM13v0.9-v1.1):

Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. Science, 2022.
Vollger MR, et al. Segmental duplications and their variation in a complete human genome. Science, 2022.
Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. Science, 2022.
Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. A complete reference genome improves analysis of human genetic variation. Science, 2022.
Hoyt SJ, et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science, 2022.
Altemose N, et al. Complete genomic and epigenetic maps of human centromeres. Science, 2022.
Wagner J, et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol, 2022.
McCartney AM, Shafin K, Alonge M, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods, 2022.
Formenti G, Rhie A, et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods, 2022.
Jain C, et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods, 2022.
Altemose N, Maslan A, Smith OK et al. DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide. Nat Methods, 2022.

Earlier citations:

Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics, 2019.
Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature, 2020.
Nurk S, Walenz BP, et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 2020.
Logsdon GA, et al. The structure, function, and evolution of a complete human chromosome 8. Nature, 2021.

History

* rel1 and 2: 2nd March 2019. Initial release.
* asm v0.6 and canu rel2 assembly: 28th May 2019. Assembly update.
* Hi-C data added: 25th July 2019. Data update.
* asm v0.6 alignments of rel2 added: 30th Aug 2019. Data Update
* rel3: 16th Sept 2019. Data update.
* chrX v0.7, canu 1.9 and flye 2.5 rel3 assembly: 24th Oct 2019. Assembly update.
* shasta rel3 assembly: 20th Dec 2019. Assembly update.
* chr8 v3, rel4 data: 21 Feb 2020. Data and assembly update.
* update rel3 partition names since some tars included more than a single partition. 16 Apr 2020.
* add CLR/HiFi mappings to chrX v0.7. 8 May 2020.
* update partitions 23,28,30,53,55 and add 227-231 (data was missing from upload). 13 May 2020. Data update.
* add rel5 guppy 3.6.0 data: 4 Jun 2020. Data update.
* add chr8 v9. Aug 26 2020. Assembly update.
* add v0.9/v1.0 genome releases. Sept 22 2020. Assembly update.
* add v0.9/v1.0 alignment files. Sept 29 2020. Assembly update.
* add new UW data. Oct 6 2020. Data update.
* add rna-seq data. Dec 4 2020. Data update.
* add repeat and telomere annotations for v1.0. Dec 17 2020. Assembly annotation update.
* v1.1 assembly and related files. May 7 2021. Assembly update.
* v2.0 assembly and related files. Dec 2 2022. Assembly and annotation update.
* 1KGP variant calls for all chromosomes. Jan. 3 2023. Annotation update.
* 1KGP and SGDP bam / vcf released publicly on [AnVIL_T2T_CHRY](https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T_CHRY). May 23, 2023. Data Update.
* 1KGP AF release. Jul 6 2023. Annotation update.
* Curated RefSeq/Liftoff v5.1 release. Jul 6 2023. Annotation update.
* Curated RefSeq/Liftoff v5.2 release. Aug 23 2024. Protein coding gene annotation update.
* Link page for custom RepeatMasker library with T2T repeats. Nov 19 2024.

marbl / CHM13

readme