hackseq / 2016_project_6

Inferring sex chromosome and autosomal ploidy in NGS data
2 stars 1 forks source link

Run LASTZ on all possible combinations of chromosomes X and Y #7

Open BrunoGrandePhD opened 7 years ago

BrunoGrandePhD commented 7 years ago

http://www.bx.psu.edu/~rsharris/lastz/

tanyaphung commented 7 years ago

Reference genome is downloaded from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

The file is saved in the directory /hackathon/Hackathon_Project_6/DOWNLOADS/GRCh38

tanyaphung commented 7 years ago

Download the newest version of lastZ: http://www.bx.psu.edu/~rsharris/lastz/newer/lastz-1.03.73.tar.gz

Documentation: http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_rdotplot

ekarlins commented 7 years ago

cd /hackathon/Hackathon_Project_6/DOWNLOADS/Software wget http://www.bx.psu.edu/~rsharris/lastz/newer/lastz-1.03.73.tar.gz tar -xzf lastz-1.03.73.tar.gz

tanyaphung commented 7 years ago

I just did a quick check on running lastZ on chrY downloaded from the UCSC browser http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz, and the plot looks like something I would expect: ucscgrch38_chry_chry

So my guess is that there is something weird going on with the reference Y chromosome file from the 1000 G website? ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

Thoughts?

thw17 commented 7 years ago

It's really odd that the 1000 genomes Y is so different. I really don't know what to make of the 1000 genomes version, but completely agree that the UCSC Y looks like what we would expect.

ekarlins commented 7 years ago

Tanya, I think you told me you used sed to extract chrY from the larger fasta. Is it possible you grabbed some other contigs as well. If you had those HLA contigs, for example, I would imagine you'd get some strange results.

Can you point us to your chrY fasta files and send the code you used to generate them?

Thanks, Eric

mathbionerd commented 7 years ago

Hi Tanya,

I did the same thing last night (running lastZ self alignment on the chrY from hg38 from UCSC) and got the same results. I agree with Eric, perhaps the 1000 genomes version has extra contigs in it that is messing it up.

Also, for everyone else, how cool is it that you can see the total lack of information we have about that huge block of heterochromatin on the distal arm of the Y chromosome?!

As a thought can we check the positions of the hg38 chrY versus the 1000genomes Y sequence? It may be that for producing a "mask", the hg38 might be good for our purposes.

Best, Melissa

mathbionerd commented 7 years ago

Testing out lastZ on my local server.

First, grabbing hg38 versions of chromosome X and Y (perhaps there is something odd about the other versions we're using):

 wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz' -O chrY.fa.gz
 wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrX.fa.gz' -O chrX.fa.gz

And unzip: gunzip chrX.fa.gz gunzip chrY.fa.gz

tanyaphung commented 7 years ago

So I first did: grep ">" GRCh38_full_analysis_set_plus_decoy_hla.fa > description_GRCh38_full_analysis_set_plus_decoy_hla.fa

to grep out the description lines.

Then, to grep out chrY sequences, which is between the lines:

chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38 hm:10001-2781479,56887903-57217415

and:

chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38 tp:circular

I ran a simple sed command:

sed -n '/>chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38 hm:10001-2781479,56887903-57217415/,/>chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38 tp:circular/p' GRCh38_full_analysis_set_plus_decoy_hla.fa > chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa

That should give me a file that looks like:

chrY AGTGCTGGAT...... chrM

grep -v ">chrM" chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa > chrYClean_chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa

gives me a chrY sequences only.

I have uploaded all these files to /home/msayres on orca.

I think perhaps I downloaded the wrong reference sequence from 1000 G? The name plus_decoy_hla is suspicious...

mathbionerd commented 7 years ago

It looks like the 1000genomes Y chromosome reference has the PARs masked out (not sure about the XTR), and may have some extra characters (non-A|T|G|C|N). Now the team is plotting A|T|G|C (0), N (1), non-A|T|G|C|N (2) across the 1000genomes Y reference and the hg38 reference, and re-running lastZ.

tanyaphung commented 7 years ago

Here is the alignment plot when I ran lastZ on the UCSC GRch38 chrY and the 1000G chrY: ucscgrch38chry_1000gchry

tanyaphung commented 7 years ago

Running lastZ on 1000G chrX (or chrY) with the standard flag results in an error:

FAILURE: bad fasta character in /u/scratch/p/phung428/hackseq/data/thousandG/chrXClean_GRCh38_full_analysis_set_plus_decoy_hla.fa, >chrX (uppercase Y) remove or replace non-ACGTN characters or consider using --ambiguous=iupac

ekarlins commented 7 years ago

Here are the base counts in the two versions of chrY fasta files.

UCSC: {'a': 4892104, 'A': 2994088, 'C': 1876822, 'g': 3397589, 'G': 1889305, 'c': 3408967, 'N': 30812232, 't': 4953284, 'n': 140, 'T': 3002884}

1000G: {'A': 7155845, 'C': 4632232, 'T': 7217789, 'G': 4630489, 'N': 33591025}

plots across chrom are taking forever. I need to rethink how I'm plotting.

Eric

mathbionerd commented 7 years ago

Just summing up, so we can compare the references easier:

Base 1000G UCSC
A/a 7155845 7886192
C/c 4632232 5285789
T/t 7217789 7956168
G/g 4630489 5286894
N/n 33591025 30812372
Total_nonN 23636355 26415043
Total 57227380 57227415
ekarlins commented 7 years ago

chrY_fasta_N.pdf

ekarlins commented 7 years ago
chry_n
tanyaphung commented 7 years ago

Thanks, Eric, for making this figure. This confirmed what Melissa was saying that the 1000G reference masks the PAR regions.

So I think if I removed the PAR regions first on the 1000G chrY, and then run lastZ, it should work.

tanyaphung commented 7 years ago

lastZ alignment between UCSC chrX and UCSC chrX: ucscgrch38chrx_ucscgrch38chrx

lastZ alignment between UCSC chrX and UCSC chrY: ucscgrch38chrx_ucscgrch38chry