Download datasets from 1000G project

hackseq / 2016_project_6

Inferring sex chromosome and autosomal ploidy in NGS data

2 stars 1 forks source link

Download datasets from 1000G project #4

Closed BrunoGrandePhD closed 7 years ago

BrunoGrandePhD commented 7 years ago

http://www.internationalgenome.org/data

BrunoGrandePhD commented 7 years ago

Criteria:

Two populations (YRI and CEU)
Phase 3
GRCh38
5 males and 5 females from each population
Download only X and Y chromosomes

ekarlins commented 7 years ago

/hackathon/Hackathon_Project_6/DOWNLOADS/code/extractX.sh /hackathon/Hackathon_Project_6/DOWNLOADS/code/extractY.sh

These are scripts to extract X and Y from 1000G ftp site.

./extractX.sh ftp://ftp.1000genomes.ebi.ac.uk/in.bam out.bam

mathbionerd commented 7 years ago

README about 1000 genomes GRCh38:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/README.1000genomes.GRCh38DH.alignment

mathbionerd commented 7 years ago

For High coverage PCR-free reads:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.high_coverage.GRCh38DH.alignment.index

mathbionerd commented 7 years ago

I created a file that lists the sex of each of the high coverage files that we are accessing from the ftp server.

1000genomes_GRCh38_HighCoverage_ftp_samples.txt

ekarlins commented 7 years ago

BAM files from two samples are being extracted now: HG00419, NA20845 4 BAM files, 1 for X, 1 for Y. in /hackathon/Hackathon_Project_6/DOWNLOADS/BAMS

48 scripts, 1 for X and 1 for Y for 24 samples in: /hackathon/Hackathon_Project_6/DOWNLOADS/code/extractAll

ekarlins commented 7 years ago

Make one file per sample per chrom to download 1000G BAM.

extractBam1000G.py.txt

mathbionerd commented 7 years ago

Poznik et al 2013 generate a mask for callable Y chromosome sequence.

Here is the mask file for hg19: chrY.callable.hg19.bed.txt

And I ran it through UCSC genome liftover tool to get the mask file for hg38: chrY.callable.hg38liftover.bed.txt

mathbionerd commented 7 years ago

All 48 files (24 X chromosomes and 24 Y chromosomes) are now downloaded or in the process of downloading, and can be found in the directory:

/hackathon/Hackathon_Project_6/DOWNLOADS/BAMS

It took several attempts to get some of the files to download, but I think this is all due to memory availability. Common errors seemed to be due to limitations with the I/O memory that prevented the .cram files from being completely downloaded. The two errors when trying to extract the X and Y sequences were:

[E::hts_open_format] fail to open file 'ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/TSI/NA20502/high_cov_alignment/NA20502.alt_bwamem_GRCh38DH.20150917.TSI.high_coverage.cram'

And:

Aborted (core dumped) [W::bam_hdr_read] EOF marker is absent. The input is probably truncated. [E::bgzf_read] bgzf_read_block error -1 after 718072 of 760536 bytes [E::bam_hdr_read] error reading BGZF stream samtools index: "NA19017.chrX.bam" is corrupted or unsorted