Reading current methods

mathbionerd commented 8 years ago

Howdy all,

The broad goal of this hackseq group is, "Inferring sex chromosome and autosomal ploidy in NGS data".

There are a lot of people thinking about this, and in preparation for our hack-a-thon, please read and share current similar methods.

One to read through was just posted on BioRxiv:

GenomeScope: Fast reference-free genome profiling from short reads http://biorxiv.org/content/early/2016/09/19/075978

Implementation: http://qb.cshl.edu/genomescope/, https://github.com/schatzlab/genomescope.git

Best, Melissa

mathbionerd commented 8 years ago

Also, for our reference, here is a great source of data for us to be able to download and play with: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/

thw17 commented 8 years ago

Hi all,

Here's another paper/tool:

ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004229

Cheers, Tim

ekarlins commented 8 years ago

There are also tools like ABSOLUTE: http://archive.broadinstitute.org/cancer/cga/ABSOLUTE that look at ploidy in tumor DNA.

I guess I'm a little unclear on the goals here. Some of these papers are about how to do this for organisms without reference genomes, but the 1000 genome test data seems to be human germline DNA. Would our approach be different depending on if the organism has a reference genome? If there's a reference genome does that mean that ploidy is known, or is this a type of variation that we see within species?

I should note that I've really only analyzed mammalian DNA and mostly human, so if there's a good resource on this across other species please send it my way.

Thanks, Eric

mathbionerd commented 8 years ago

The original application was general to allow us, as a group, to decide on what is the most attainable in the time we have. I'm hoping we can decide, based on literature available, whether to focus on a method for reference-based or de novo.

I'm leaning towards reference-based. But potentially without making pre-existing assumptions of ploidy (so it could work for sex chromosomes where sex is not known - or an individual may be chromosomally intersex, or for high ploidy tumor DNA).

ekarlins commented 8 years ago

Thanks Melissa! If you have a good reference for the biology please pass that along as well. Maybe a review article on the topic? This isn't something I've thought much about in the past, so I don't really know how much ploidy varies between species, within species, between chromosomes of an individual.

mathbionerd commented 8 years ago

I don't know of a good introduction to ploidy across the tree of life.

My motivation is to better characterize sex chromosome copy number variation. In particular, it has been estimated that 1/2500 life AFAB (assigned female at birth) people have a single X chromosome, while between 1/500 and 1/1000 AMAB (assigned male at birth) have two X chromosomes and a Y chromosome. But, we don't now the extent of this (whole chromosome or partial), and as genomic datasets continue to increase (e.g. 90,000 exomes in ExAc), we now have the opportunity to address this question.

But... the sex chromosomes have an extra challenge that the non-sex chromosomes (autosomes) don't have, which is that while they were once indistinguishable, they have now diverged from each other - we call them gametologous (and call individual homologous X-Y gene pairs, "gametologs"). These gametologs, although they are now distinct genes, can share significant sequence similarity that can confound alignment and assembly programs (less so for the more anciently diverged, more so for the very recently diverged X-Y pairs).

So, let's coalesce on the idea of working on a program to quickly infer copy number from exome data (because it is smaller and we can work with it - it should be extendable to WGS) on the sex chromosomes.

This can then be extended also to non-sex linked regions, but especially to those with paralogous sequence across the genome.

We only have one AWS code allocate to us - that's my fault. So, working with exome data will also be better.

We'll start by using data from the 1000 genomes project (at least one genetic male and one genetic female exome sample): http://www.internationalgenome.org/data

ekarlins commented 8 years ago

Thanks Melissa! That makes our objectives here a bit clearer. I wasn't sure where the line was between "ploidy" and "copy number variation".

Working on methods for CNV discovery from exome seq is a very useful task. There are certainly a lot of groups with exome data that could benefit. This is not the best platform for CNV discovery, however. Because the exome capture relies so heavily on PCR, the different amplification efficiencies across the exome makes depth based methods less accurate. The "probe" spacing for exomeSeq is uneven across the genome, which makes it challenging to use insert-size methods since the CNV break points are more likely to fall outside of exonic regions.

CNV discovery from exomeSeq has its own challenges, so I don't think it would be appropriate to extend exome CNV methods to WGS. PCR-free WGS is much more reliable than ecomeSeq for CNV discovery because it lacks the issues I mentioned previously. So WGS CNV discovery shouldn't be penalized by the steps that are necessary for exome CNV discovery.

For the sex chromosomes it sounds like there are unique challenges that might make it hard to extend these methods to autosomes as well. Not that I know of an appropriate dataset, but it seems like longer reads, like PacBio could be really useful here.

I guess I'm arguing for choosing one specific aim without the expectation of being able to extend it to data from other platforms or even to extend sex chromosome methods to autosomes. I also think we need to decide if our main objective is using ExomeSeq for CNV discovery, or if we want to use the best possible data type to get at your questions about CNVs/ploidy on the sex chromosomes. I guess access to data from more samples could sway us towards exome, though as far as I know we only can get the variant calls from ExAC, not the sequence data.

Best, Eric

ekarlins commented 8 years ago

Also, as far as resources go, are we restricted to just using the AWS nodes provided? I have access to a couple compute clusters at the NIH. Am I allowed to run jobs there for this project? Or would that be unfair to other groups?

tanyaphung commented 8 years ago

I'm new to this topic and so I found that this review, though focuses on CNV detection in cancer genome, is pretty helpful: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3875755/.

Melissa, from what you said with regard to how genes on the X and Y, even though they are now distinct genes, can have high sequence similarity and can therefore confound assembly and alignment, I guess we should think about how to accurately assign whether a read comes from the X chromosome or from the Y chromosome, right? Or did I completely misinterpret what you meant?

Thanks, Tanya

thw17 commented 8 years ago

Hi all,

I agree with Eric on exome sequencing providing unique challenges to infer copy number variation (probe spacing, capture efficiency, PCR amplification bias, etc.). In addition, there are many tools already available for CNV detection in exome sequencing data (they're of mixed quality; some do a fairly good job overcoming challenges of exome data, while others don't perform as well).

However, if our goal is to quickly infer sex chromosome state/ploidy, I think that this is a slightly different question. Here, rather than looking for local changes in copy number, we're simply trying to count the number of X and Y chromosomes, right? If this is the case, looking across the entire chromosome might be robust to the exome challenges discussed above (e.g., by averaging out effects), and our main issue might instead be accurately assigning reads to the X and Y chromosome (as Tanya and Melissa mentioned).

Exome sequencing does have the benefit of being much, much quicker to run through (something like 1-2 million sites on the X and Y chromosomes vs. something like 200 million sites on the X and Y with WGS data), as I'm assuming time will be a pretty major issue for us.

Cheers, Tim

mathbionerd commented 8 years ago

Counting X and Y chromosomes, full and partial (if we expect that in some relatively high proportion of people, they will have partial Y-linked sequence, or even partial X).

In terms of how much one code is worth in computing hours - The price per hour depends on the size of the machine. An m4.4xlarge machine with 16 cores and 64 GB of RAM is roughly $1 / hour. You can get more details if you take a look at the pricing schedule: https://aws.amazon.com/ec2/pricing/on-demand/

From the organizers: With regards to how many people can use AWS - The project leader will start up the machine(s) and then any number of participants can log into that machine. You will pay per machine (not per user logged in).

Yes, give we aren't working with controlled access data, we can each run things at our own local institutional HPC.

mathbionerd commented 8 years ago

Also, everyone, just FYI, from the hackseq organizers:

In addition to these final presentations, we would like to encourage teams to use the F1000Research Hackathon channel to publish the results of their projects. We will be giving each team up to $200 to cover article processing fees if teams submit their manuscripts before November 15th.

Phillip-a-richmond commented 8 years ago

I finally have some time to look into this project, and I'm a bit unclear about what exactly our goals are.

My experience in inferring ploidy states from reference-mapped data utilizes depth-of-coverage and allelic frequencies to identify a chromosomal copy number. This is all based on aneuploid/polyploid Saccharomyces cerevisiae whole genome sequencing (WGS) data.

I've attached a few plots: Coverage-approach where you can see deviations from the baseline on a log-scaled Y-axis, using a pure haploid background to compare against. Each chromosome is colored differently, and each point corresponds to a window in the yeast genome. You can see some evidence for segmental duplications/alterations as well.
1056averyscatter

Allele frequency plotted as a histogram across the entire chromosome. You can infer the different ploidies based on the number of peaks: Diploid = 1 peak, normal distributed around 50% minor allele freq. (coincidentally enough I didn't have diploid yeast data) Triploid = 2 peaks, 33% and 66% minor allele freq screen shot 2016-10-13 at 12 25 16 pm

Tetraploid = 3 peaks, 25%, 50%, 75% screen shot 2016-10-13 at 12 25 02 pm

Pentaploid = 4 peaks screen shot 2016-10-13 at 12 25 35 pm

Hexaploid = 5 peaks!!! screen shot 2016-10-13 at 12 24 40 pm

I'm sure that these techniques can be expanded to work in larger genomes if the project is intended for polyploid analysis. If we are exploring human data, then we could leverage the allele frequencies and depth of coverage from reference-mapped data to infer ploidy state.

ekarlins commented 8 years ago

Good discussion everyone! Mellisa, when you say partial chromosomes are there predefined regions that make sense? PAR vs not? Or is there a (large) size cutoff that seems reasonable to you? I agree with Tim's point that if we want to use exomeSeq we'd benefit from taking into account the whole chromosome, or at least a very large chunk of it.

While I definitely agree that using exome vs WGS will be a lot quicker, I'd prefer to have more discussion on this before making this decision. I don't know what the probe spacing really looks like on the sex chromosomes for example. And we're still refining our goals for what size CNV we want to detect. Plus without planning methods it's hard to know what we need for resources, and if each one of us has a HPC we can use maybe that's not the rate limiting step here.

I also think Tanya's point about alignment of X and Y could use further discussion. Is using an alignment method designed for the autosomes appropriate for the sex chromosomes? Or is there a better way? I know we have limited time here, so doing a new alignment likely won't be practical. But I'd like to understand what the perfect world scenario is to appreciate how our approach is limited.

When do people get into town? I get in tomorrow afternoon and am staying pretty much on campus.

Eric

Madelinehazel commented 8 years ago

Re: sequence similarity between the X and Y chromosomes, here's a review that covers how various alignment tools and de novo assemblers deal with genomic repeats ( i.e. sequences that appear twice or more in the genome): https://www.ncbi.nlm.nih.gov/pubmed/22124482 It clarifies the benefits/limitations to using an alignment-based approach vs de novo assembly with respect to repetitive sequences, and further lists the repeat-relevant parameters for tools discussed.

I'm a grad student at UBC, so I'm already in Vancouver :+1:

mathbionerd commented 8 years ago

Howdy all,

Yes, exome vs WGS is a good decision to make, and agree now that I think about all the local resources we have access to, that we shouldn't be worried about space. The probes are actually not terrible for X and Y, but we could do WGS from already aligned BAMs, and then just extract X and Y.
The problem space here is quite large, and I was hoping, based on expertise in the group, to narrow it down. That said, what I know about is sex chromosomes, and so would like us to focus on those.
No, I didn't mean PAR vs not, although that is a biologically meaningful pair of regions to take into account. I'm particularly interested to see how often we observe both complete and partial X/Y chromosome variations in human populations (but this is also affected by our ability to not be confounded by the ampliconic/palindromic regions on the sex chromosomes).
I can appreciate that exome data does not have as high density of coverage as one might like. There are, as was mentioned, many challenges with alignment that I don't think we have the time to dedicate to during these three days. So, what do you all think of extracting X and Y from the already aligned BAMs, and then working with these, to see if we can assess X-Y misalignment and (for argument's sake) depth across these two chromosomes.

Then we can have three increasing goals:

Assess proportion of X-Y mis-alignment (and try to correct for this)
Infer total X and Y ploidy
Infer copy number variation across the X and Y (because I think we have to address this if we want to get a really good handle on #2 given the extremely high copy number variable regions on X and Y - the ampliconic regions. Likely we will masking them out to infer #2, which will be easiest, but then we can have an extended goal to see characterize variations in these regions).

If you are unfamiliar with the X and Y chromosomes, here are two classic papers:

chromosome X: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665286/ chromosome Y: http://www.nature.com/nature/journal/v423/n6942/full/nature01722.html

ekarlins commented 8 years ago

Poznik2013.pdf

hackseq / 2016_project_6

Reading current methods #2