Update yeast analysis - Githubissues

eldariont commented 5 years ago

Hi, it took me longer than expected due to multiple reasons but here are the new yeast results.

Method: I updated the evaluation method so that only SVs (minimum size 50bp) in the callsets are used to construct the sample graph. I also slightly adapted Jean's method of zooming in on the SV regions. Besides the two sample graphs for the VCF and cactus graphs, I generated an "empty sample graph" that contains simply the linear reference genome. I mapped the short reads to all three graphs and for each read compared percent identity and mapq on the three graphs. I filter out those reads that have the same id across all three graphs and the same mapq. These are reads that map with exactly the same characteristics to the linear graph as to the sample graphs which indicates that they come from a region of the genome that is identical to the reference. The remaining reads should be from SV regions.

Results: Panel 4 (SV genotyping results for graphs from 5 selected strains): panel4

In the results, we see a few good trends:

Cactus graph reaches overall higher percent identity (mean improvement in percent identity = 8.5%)
Improvement in percent identity between both graphs was larger for strains in the S. paradoxus clade (between 7% and 20%) than for strains in the S. cerevisiae clade (between 1% and 12%)
Generally lower percent identity for strains in the S. paradoxus clade on both sample graphs
Higher average mapping quality of the short reads for all strains on the cactus sample graph compared to the VCF sample graph (mean improvement in mapping quality = 5.7)

One thing that I'm a bit worried about is the comparison with the graphs from all of the strains.

Panel 6 (SV genotyping results for graphs from all 12 strains): panel6

Some strains like SK1 reach worse percent identity on the all-sample graphs than on the five-sample graphs. I think that this is due to the new evaluation method which computes percent identity and mapq on a different subset of reads for each strain and each strain subset (i.e. five-sample or all-sample). So different subsets of the SK1 short reads are selected using the filtering criteria described above (non-equal perc. id and mapq on the three sample graphs) for the five-sample graph and the all-sample graph, respectively. That means that the results from panels 4 and 6 are not comparable.

I'm also wondering whether it makes sense at all to plot the average perc. identity (and mapq) after applying the filtering criteria because we are only looking at averages over a subset of the reads. Plotting them suggests that the averages have a meaning which they not really have. Instead we could plot the average difference in perc. identity and mapq between both graphs which, in my opinion, makes more sense. What do you think about this, @jmonlong and @glennhickey?

ekg commented 5 years ago

I filter out those reads that have the same id across all three graphs and the same mapq. These are reads that map with exactly the same characteristics to the linear graph as to the sample graphs which indicates that they come from a region of the genome that is identical to the reference.

I'm concerned that this will distort your averaged results for identity and mapping quality. Including these would tend to pull the estimates closer to the y=x line on the plots. That does make it harder to see things, but it's easier to defend. The fact that this data is subset in this way needs to be made very clear. Is it the case that leaving these in has the effect I'm anticipating?

eldariont commented 5 years ago

Yes, that is indeed the case. At first, we included all reads but the improvement of the cactus graph over the VCF graph was relatively minor because most reads come from uninteresting regions. Therefore, we explored ways to "zoom into" the SV regions (see #95 for our discussion).

ekg commented 5 years ago

OK, that makes sense for me. I didn't understand the concept of "zooming in".

I think that this aspect needs to be very clear though. Maybe the non-zoomed plots should be in the supplement and mentioned here.

glennhickey commented 5 years ago

I think this is defensible. Looking at the deltas instead of the averages, like you say David, may be one way to make it simpler though. Plotting all the deltas in a histogram with a log-scale, may let you include all the reads but still show the enrichment for cactus.

On Mon, May 27, 2019 at 8:42 AM Erik Garrison notifications@github.com wrote:

OK, that makes sense for me. I didn't understand the concept of "zooming in".

I think that this aspect needs to be very clear though. Maybe the non-zoomed plots should be in the supplement and mentioned here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jmonlong/manu-vgsv/pull/101?email_source=notifications&email_token=AAG373TQZ4OGXJKJJZ2NNYDPXPJNDA5CNFSM4HP3I4F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJWUBI#issuecomment-496200197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373RXFIB6LFYJW2Y2SGLPXPJNDANCNFSM4HP3I4FQ .

ekg commented 5 years ago

Looking at the deltas instead of the averages

Here, we're showing the average delta for reads which change in mapping identity and quality between graphs. It's the average change for the things that change.

I agree that it's defensible! I'm just stressing caution so a reviewer or eventual reader doesn't feel misled.

glennhickey commented 5 years ago

On second thought, it may be tricky to put all the strains in one histogram. Anyway, I agree we can merge this in with the clear explanation, and worry about further clarifications in review or at least after it's up on bioarxiv.

On Mon, May 27, 2019 at 9:51 AM Erik Garrison notifications@github.com wrote:

Looking at the deltas instead of the averages

Here, we're showing the average delta for reads which change in mapping identity and quality between graphs. It's the average change for the things that change.

I agree that it's defensible! I'm just stressing caution so a reviewer or eventual reader doesn't feel misled.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jmonlong/manu-vgsv/pull/101?email_source=notifications&email_token=AAG373XIWUDVQ5TDFC25IHDPXPRM7A5CNFSM4HP3I4F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJ3NPI#issuecomment-496219837, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373XTIGGAP7PPTQIQWPDPXPRM7ANCNFSM4HP3I4FQ .

eldariont commented 5 years ago

Okay great, I changed panels 4 and 6 so that they now show the average delta of mapping identity and quality instead of absolute values. I use barplots for this which allows me to plot results for the five-sample graphs and the all-sample graphs side by side.

Panel 4 (SV genotyping results using reads from SV regions only): panel4

Panel 6 (SV genotyping results using all reads): panel6

As Erik suggested, Panel 6 shows the results using all the reads (non-zoomed in) and of course the improvement of cactus is slighter there.

I tried to adapt the text in Results, Methods and the Supplementary according to my changes but it was pretty hard to make it easy-to-understand because the evaluation and the different graphs and strain sets have become quite complicated by now. I would appreciate any suggestions/commits for improvement.

jmonlong commented 5 years ago

I slightly polished the figure and made some small edits. It can be merged as far as I'm concerned.

One thing I just though about @eldariont, is it the "average of the delta" or the "delta of the averages"? We say the former but might be doing the latter.

eldariont commented 5 years ago

One thing I just though about @eldariont, is it the "average of the delta" or the "delta of the averages"? We say the former but might be doing the latter.

It is both actually because they are identical (if I'm not mistaken). Let's say c_i and v_i is the mapping identity of read i on cactus and vcf, respectively, and n is the number of reads. Then the delta of averages is sum(c)/n - sum(v)/n = (sum(c) - sum(v)) / n and the average of deltas is sum(c_i - v_i) / n = (sum(c) - sum(v)) / n.

eldariont commented 5 years ago

Thanks for the edits and polishing! Panels 4 and 6 look much better now. If you agree, we could merge.

jmonlong / manu-vgsv

Update yeast analysis #101