marbl / CHM13

The complete sequence of a human genome
Other
920 stars 99 forks source link

Telomere-to-telomere consortium CHM13 project

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser hub is available for CHM13 and T2T-Primates. Track updates will be made to this hub until integrated into the UCSC Genome Browser for hs1. Legacy UCSC browsers are available for v2.0, v1.0 and v1.1 versions.

An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

Latest assembly release

T2T-CHM13v2.0 (T2T-CHM13+Y)

Complete T2T reconstruction of a human genome with Y. Changes from v1.1 is the addition of a finished chromosome Y from the GIAB HG002/NA24385 sample, sequenced both by GIAB and HPRC. This genome is also available at NCBI (GCA_009914755.4) and at UCSC. Note that even though the UCSC browser shows the Genbank accessions as sequence names on the browser itself, it can load annotations in BED/bigBed/BAM/CRAM/bigWig and other formats or search using the "chr1/2/etc" names.

Previous assembly releases are available below:

Downloads

Sequencing data

The sequencing dataset generated for CHM13 is available on this page.

Analysis set

Analysis set for using T2T-CHM13v2.0 (T2T-CHM13+Y) as a reference for mapping based research is available at aws with a README.

Sep. 28 2022 update: all analysis-set fa.gz files have been re-compressed with bgzip. Index files are available at aws with updated md5s in the README.

Gene annotation

Repeat annotation

Epigenetic profile

Variant calls

Liftover resources

Non-syntenic region

Notes on downloading files

Files are generously hosted by Amazon Web Services under s3://human-pangenomics/T2T/CHM13 and through this web interface.

Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/ with s3://human-pangenomics/T2T to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://human-pangenomics/T2T/CHM13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://human-pangenomics/T2T/CHM13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/ 

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. We would appreciate if you would acknowledge and cite the "Telomere-to-Telomere" (T2T) Consortium for the creation of this data. More information about our consortium can be found on the T2T homepage and a list of related citations is available below:

T2T-CHM13v2.0, datasets released along the v2.0 and the T2T-Y chromosome

The complete sequence of a human genome and companion papers (T2T-CHM13v0.9-v1.1):

  1. Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. Science, 2022.
  2. Vollger MR, et al. Segmental duplications and their variation in a complete human genome. Science, 2022.
  3. Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. Science, 2022.
  4. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. A complete reference genome improves analysis of human genetic variation. Science, 2022.
  5. Hoyt SJ, et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science, 2022.
  6. Altemose N, et al. Complete genomic and epigenetic maps of human centromeres. Science, 2022.
  7. Wagner J, et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol, 2022.
  8. McCartney AM, Shafin K, Alonge M, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods, 2022.
  9. Formenti G, Rhie A, et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods, 2022.
  10. Jain C, et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods, 2022.
  11. Altemose N, Maslan A, Smith OK et al. DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide. Nat Methods, 2022.

Earlier citations:

  1. Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics, 2019.
  2. Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature, 2020.
  3. Nurk S, Walenz BP, et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 2020.
  4. Logsdon GA, et al. The structure, function, and evolution of a complete human chromosome 8. Nature, 2021.

History

* rel1 and 2: 2nd March 2019. Initial release.
* asm v0.6 and canu rel2 assembly: 28th May 2019. Assembly update.
* Hi-C data added: 25th July 2019. Data update.
* asm v0.6 alignments of rel2 added: 30th Aug 2019. Data Update
* rel3: 16th Sept 2019. Data update.
* chrX v0.7, canu 1.9 and flye 2.5 rel3 assembly: 24th Oct 2019. Assembly update.
* shasta rel3 assembly: 20th Dec 2019. Assembly update.
* chr8 v3, rel4 data: 21 Feb 2020. Data and assembly update.
* update rel3 partition names since some tars included more than a single partition. 16 Apr 2020.
* add CLR/HiFi mappings to chrX v0.7. 8 May 2020.
* update partitions 23,28,30,53,55 and add 227-231 (data was missing from upload). 13 May 2020. Data update.
* add rel5 guppy 3.6.0 data: 4 Jun 2020. Data update.
* add chr8 v9. Aug 26 2020. Assembly update.
* add v0.9/v1.0 genome releases. Sept 22 2020. Assembly update.
* add v0.9/v1.0 alignment files. Sept 29 2020. Assembly update.
* add new UW data. Oct 6 2020. Data update.
* add rna-seq data. Dec 4 2020. Data update.
* add repeat and telomere annotations for v1.0. Dec 17 2020. Assembly annotation update.
* v1.1 assembly and related files. May 7 2021. Assembly update.
* v2.0 assembly and related files. Dec 2 2022. Assembly and annotation update.
* 1KGP variant calls for all chromosomes. Jan. 3 2023. Annotation update.
* 1KGP and SGDP bam / vcf released publicly on [AnVIL_T2T_CHRY](https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_T2T_CHRY). May 23, 2023. Data Update.
* 1KGP AF release. Jul 6 2023. Annotation update.
* Curated RefSeq/Liftoff v5.1 release. Jul 6 2023. Annotation update.
* Curated RefSeq/Liftoff v5.2 release. Aug 23 2024. Protein coding gene annotation update.
* Link page for custom RepeatMasker library with T2T repeats. Nov 19 2024.