human-pangenomics / HPP_Year1_Assemblies

Assemblies from HPP Year 1 production
64 stars 8 forks source link

Year 1 Assemblies

White Logo

This repo describes assemblies produced by the Human Pangenome Reference Consortium from year 1 data. Assemblies for 47 samples are available. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.


Obtaining assemblies

All assemblies are accessioned at GenBank. They can be downloaded from public HPRC S3 bucket with no egress fee. Individual S3 and GCP URLs to each assembly can be found in the index file in this repo.

Assemblies are also available as a single 1.5GB file in the AGC format. Users need to download AGC to extract individual assemblies. For example:

# download all assemblies
curl -o HPRC-yr1.agc https://zenodo.org/record/5826274/files/HPRC-yr1.agc?download=1

# download precompiled AGC binary for Linux
curl -L https://github.com/refresh-bio/agc/releases/download/v1.1/agc-1.1_x64-linux.tar.gz|tar -zxvf - agc-1.1_x64-linux/agc

# list all samples
agc-1.1_x64-linux/agc listset HPRC-yr1.agc

# extract sample NA18906.1
agc-1.1_x64-linux/agc getset HPRC-yr1.agc NA18906.1 > NA18906.1.fa

Genbank Version Of Assemblies

Freeze 1 (v2) assemblies were uploaded to Genbank/NCBI and are now available. These assemblies are expected to be the final release of the year 1 assemblies and should be used for all analysis and pangenome work. A list of the current assemblies and their download links can be found in the index file assembly_index/Year1_assemblies_v2_genbank.index.

As part of the upload to Genbank, contaminated contigs are identified and dropped. Some contigs which are almost certainly contamination were not identified, however. In addition, leading/trailing hard-masked nucleotides were trimmed from the contigs -- resulting in a co-ordinate change for those contigs. Lastly, contigs were renamed by Genbank with their accession IDs. Files have been provided in the genbank_changes folder to document these changes.

Files in the genbank_changes folder:


Assembly Process

Assemblies were processed in AnVIL using publicly available workflows in the Human Pangenomics hpp_production_workflows repo

A summary of the process is below:


Automated QC

After assembly masking, decontamination, and MT correction, assemblies underwent automated QC in AnVIL with the following tools:

Select metrics were extracted and placed into the automated_qc_results/ directory of this repo: raw values are in a CSV alongside charts for N50, hamming/switch error rates, QV values, and contig counts. Full results from automated QC are included next to the assemblies in both the AWS and GCP HPRC buckets.

For more information about the automated QC, please see the QC workflows in the HPP GitHub Repo.


Data Layout

Assemblies are available in the working directory of the HPRC S3 and GCP buckets and is organized by each trios' child sample ID:

 ── working/
    └── HPRC/
        └── HG00438/
            └── raw_data/ 
            └── assemblies/
                └── year1_f1_assembly_v2_genbank/
                    └── HG00438.maternal.f1_assembly_v2_genbank.fa.gz
                    └── HG00438.paternal.f1_assembly_v2_genbank.fa.gz              
                └── year1_freeze_assembly_v2/
                    └── HG00438.maternal.f1_assembly_v2.fa.gz
                    └── HG00438.paternal.f1_assembly_v2.fa.gz
                    └── assembly_qc/  
                        └── asmgene/
                        └── dipcall_v0.1/
                        └── dipcall_v0.2/
                        └── merqury/
                        └── quast/
                        └── yak/

Note that the current version of the assemblies is under year1_f1_assembly_v2_genbank/, but prior versions are also listed. Automated QC for the assemblies is under year1_freeze_assembly_v2/ since the QC was run on that version of the assemblies.

Raw hifiasm output (including GFAs) are included for the currently assembly release in each sample's working area under assemblies/hifiasm_v0.14_raw/.

A complete list of the paths to the assembly fastas in S3/GCP can be found in the assembly_index/ directory of this repo.

AnVIL users can access the data stored in GCP through the public AnVIL_HPRC workspace. Alternatively, data can be accessed directly from AWS or GCP, and data stored in the HPRC S3 Bucket can be accessed without egress fees.


Known Issues

Assembly Change Log

* Jan 15, 2021: internal release of v1 assemblies
* Mar 08, 2021: internal release of v2 assemblies
  * Added two new samples
  * Added HiFiAdapterFilt preprocessing of HiFi data
  * Switched to Hifiasm v0.14
  * Added masking based on minimap2 alignments of SMRTBell adapter dimer
* Mar 18, 2021: fixed misjoin and false duplications in paternal assemblies for HG01123 & HG01358
* Apr 06, 2021: fixed misjoin and false duplication in HG002 maternal contig.
* Jun 23, 2021: internal release of v2 assemblies (Genbank version)
  * Genbank identified and dropped around 3k contigs (almost all EBV)
  * Assemblies were renamed to reflect their accesions in Genbank
  * Some assemblies were trimmed to remove leading/trailing N's from adapter hard masking