Run LASTZ on all possible combinations of chromosomes X and Y

BrunoGrandePhD commented 7 years ago

http://www.bx.psu.edu/~rsharris/lastz/

tanyaphung commented 7 years ago

Reference genome is downloaded from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

The file is saved in the directory /hackathon/Hackathon_Project_6/DOWNLOADS/GRCh38

tanyaphung commented 7 years ago

Download the newest version of lastZ: http://www.bx.psu.edu/~rsharris/lastz/newer/lastz-1.03.73.tar.gz

Documentation: http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_rdotplot

ekarlins commented 7 years ago

cd /hackathon/Hackathon_Project_6/DOWNLOADS/Software wget http://www.bx.psu.edu/~rsharris/lastz/newer/lastz-1.03.73.tar.gz tar -xzf lastz-1.03.73.tar.gz

tanyaphung commented 7 years ago

I just did a quick check on running lastZ on chrY downloaded from the UCSC browser http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz, and the plot looks like something I would expect: ucscgrch38_chry_chry

So my guess is that there is something weird going on with the reference Y chromosome file from the 1000 G website? ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

Thoughts?

thw17 commented 7 years ago

It's really odd that the 1000 genomes Y is so different. I really don't know what to make of the 1000 genomes version, but completely agree that the UCSC Y looks like what we would expect.

ekarlins commented 7 years ago

Tanya, I think you told me you used sed to extract chrY from the larger fasta. Is it possible you grabbed some other contigs as well. If you had those HLA contigs, for example, I would imagine you'd get some strange results.

Can you point us to your chrY fasta files and send the code you used to generate them?

Thanks, Eric

mathbionerd commented 7 years ago

Hi Tanya,

I did the same thing last night (running lastZ self alignment on the chrY from hg38 from UCSC) and got the same results. I agree with Eric, perhaps the 1000 genomes version has extra contigs in it that is messing it up.

Also, for everyone else, how cool is it that you can see the total lack of information we have about that huge block of heterochromatin on the distal arm of the Y chromosome?!

As a thought can we check the positions of the hg38 chrY versus the 1000genomes Y sequence? It may be that for producing a "mask", the hg38 might be good for our purposes.

Best, Melissa

mathbionerd commented 7 years ago

Testing out lastZ on my local server.

First, grabbing hg38 versions of chromosome X and Y (perhaps there is something odd about the other versions we're using):

 wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz' -O chrY.fa.gz
 wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrX.fa.gz' -O chrX.fa.gz

And unzip: gunzip chrX.fa.gz gunzip chrY.fa.gz

tanyaphung commented 7 years ago

So I first did: grep ">" GRCh38_full_analysis_set_plus_decoy_hla.fa > description_GRCh38_full_analysis_set_plus_decoy_hla.fa

to grep out the description lines.

Then, to grep out chrY sequences, which is between the lines:

chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38 hm:10001-2781479,56887903-57217415

and:

chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38 tp:circular

I ran a simple sed command:

sed -n '/>chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38 hm:10001-2781479,56887903-57217415/,/>chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38 tp:circular/p' GRCh38_full_analysis_set_plus_decoy_hla.fa > chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa

That should give me a file that looks like:

chrY AGTGCTGGAT...... chrM

grep -v ">chrM" chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa > chrYClean_chrY_GRCh38_full_analysis_set_plus_decoy_hla.fa

gives me a chrY sequences only.

I have uploaded all these files to /home/msayres on orca.

I think perhaps I downloaded the wrong reference sequence from 1000 G? The name plus_decoy_hla is suspicious...

mathbionerd commented 7 years ago

It looks like the 1000genomes Y chromosome reference has the PARs masked out (not sure about the XTR), and may have some extra characters (non-A|T|G|C|N). Now the team is plotting A|T|G|C (0), N (1), non-A|T|G|C|N (2) across the 1000genomes Y reference and the hg38 reference, and re-running lastZ.

tanyaphung commented 7 years ago

Here is the alignment plot when I ran lastZ on the UCSC GRch38 chrY and the 1000G chrY: ucscgrch38chry_1000gchry

tanyaphung commented 7 years ago

Running lastZ on 1000G chrX (or chrY) with the standard flag results in an error:

FAILURE: bad fasta character in /u/scratch/p/phung428/hackseq/data/thousandG/chrXClean_GRCh38_full_analysis_set_plus_decoy_hla.fa, >chrX (uppercase Y) remove or replace non-ACGTN characters or consider using --ambiguous=iupac

ekarlins commented 7 years ago

Here are the base counts in the two versions of chrY fasta files.

UCSC: {'a': 4892104, 'A': 2994088, 'C': 1876822, 'g': 3397589, 'G': 1889305, 'c': 3408967, 'N': 30812232, 't': 4953284, 'n': 140, 'T': 3002884}

1000G: {'A': 7155845, 'C': 4632232, 'T': 7217789, 'G': 4630489, 'N': 33591025}

plots across chrom are taking forever. I need to rethink how I'm plotting.

Eric

mathbionerd commented 7 years ago

Just summing up, so we can compare the references easier:

Base	1000G	UCSC
A/a	7155845	7886192
C/c	4632232	5285789
T/t	7217789	7956168
G/g	4630489	5286894
N/n	33591025	30812372
Total_nonN	23636355	26415043
Total	57227380	57227415

ekarlins commented 7 years ago

chrY_fasta_N.pdf

ekarlins commented 7 years ago

tanyaphung commented 7 years ago

Thanks, Eric, for making this figure. This confirmed what Melissa was saying that the 1000G reference masks the PAR regions.

So I think if I removed the PAR regions first on the 1000G chrY, and then run lastZ, it should work.

tanyaphung commented 7 years ago

lastZ alignment between UCSC chrX and UCSC chrX: ucscgrch38chrx_ucscgrch38chrx

lastZ alignment between UCSC chrX and UCSC chrY: ucscgrch38chrx_ucscgrch38chry

hackseq / 2016_project_6

Run LASTZ on all possible combinations of chromosomes X and Y #7