Open mlin opened 10 years ago
Quick answers first:
Specifically which GRCh38 sequences were used for mapping? (There is a file called "no_alt_analysis_set", which masks centromeres, PAR alleles, adds an EBV, does not include alternative haplotypes, etc...is it that set, or something else?)
I was using the no_alt_analysis_set. In the manuscript, I said "primary assembly", which by definition excludes alternative haplotypes.
Can you provide the LCR annotations for 38?
Just added.
(Fig 6) how does this look with all filters applied?
I will try later. I would expect most hets to be filtered out by the max depth filter.
(Fig 6) re: the breakdown of variants lifted over to 37, it would be useful to specify the fraction of autosomal 38 positions that don't lift over to 37 and that lift non-syntenically. The relative proportion could be compared to what's seen here.
If I understand your question correctly, the answer is in ref.els: 5967 GRCh38 hets are lifted to other chromosomes or unlocalized contigs in GRCh37; 66994 cannot be lifted. 73.0k in the figure equals 5967+66994.
In Fig 1 the axis scale of the bottom left panel is easy to miss and hugely affects the interpretation. At minimum please put more visual cues of the scale (e.g. a 'broken axis' icon) or perhaps reconsider zooming in at all.
Perhaps I may add a note in the figure caption. Thanks.
As you know, there's WGS available for numerous relatives of NA12878. It would be great to see the extent to which NA12878's variants are detected in her parents, comparing those that pass filters to those that fail. This would provide a nice refinement of the use of NA12878 variants as a notion of "sensitivity."
I am not sure this improves the measurement of sensitivity. When calling the trio, we will miss variants in parents. This will affect the sensitivity measurement. Analyzing the parents also brings a lot of other questions and potential issues (e.g. more data analysis, genotype calling and trio consistency), which we cannot expand at length due to the page limit.
We're curious to see how GATK's VQSR affects the usefulness of the QU filter for CHM1.
I have seen a few VQSR call sets for NA12878. Most of them are conservative - high specificity but low sensitivity. I guess for CHM1, it will be the same. For a few high-coverage samples, hand filtering is probably sufficient and might be better. Trained on good sites, VQSR might bias against variants in hard regions. I am also not sure whether the low heterozygosity of CHM1 will affect the performance of VQSR.
For the DP filter, we're curious of the origin of the square-root-based formula - Poisson standard deviation?
Yes.
(3.3.2) Following your discussion comparing the aligners, do you have any comments comparing and contrasting the callers and/or the complete pipelines? Tell us what you really think :)
I am still a SAMtools developer and I am sitting close to the GATK developers. It is hard for me to give an unbiased view. Check the figures and draw your own conclusions. :)
(3.3.3) Please briefly mention the theoretical principles you had in mind which connect inbreeding coefficient and H-W p-value with CNVs and reference artifacts - excess of heterozygotes?
Yes, negative inbreeding coefficient implies excessive heterozygotes. Because 1000g consists of samples from many populations, it is expected to see Hardy-Weinberg violations with excessive homozygotes.
Typo at the bottom of pg 7, 'sort' => 'short'
I have fixed that. Thanks.
We're curious to see how GATK's VQSR affects the usefulness of the QU filter for CHM1.
In BGI's recent preprint, Table 4 shows that UG+VQSR called 3.16M SNPs from the same NA12878 data set. On this data set, we should call 3.4-3.6M SNPs instead. VQSR is too conservative for high-coverage calling, which is in line with my experience.
Dear Heng,
We had the pleasure of discussing your important preprint at our journal club last week. Here's some feedback we collected.
First and foremost, we deeply implore you to provide more-specific conclusions & guidance on the best use of GRCh38 in WGS pipelines. This is an immediate issue in the field, and every bit of preliminary information provides a valuable touchstone at this point. A few specific questions:
In Fig 1 the axis scale of the bottom left panel is easy to miss and hugely affects the interpretation. At minimum please put more visual cues of the scale (e.g. a 'broken axis' icon) or perhaps reconsider zooming in at all.
As you know, there's WGS available for numerous relatives of NA12878. It would be great to see the extent to which NA12878's variants are detected in her parents, comparing those that pass filters to those that fail. This would provide a nice refinement of the use of NA12878 variants as a notion of "sensitivity."
We're curious to see how GATK's VQSR affects the usefulness of the QU filter for CHM1.
For the DP filter, we're curious of the origin of the square-root-based formula - Poisson standard deviation?
(3.3.2) Following your discussion comparing the aligners, do you have any comments comparing and contrasting the callers and/or the complete pipelines? Tell us what you really think :)
(3.3.3) Please briefly mention the theoretical principles you had in mind which connect inbreeding coefficient and H-W p-value with CNVs and reference artifacts - excess of heterozygotes?
Typo at the bottom of pg 7, 'sort' => 'short'
Best, Mike Lin on behalf of the DNAnexus science team