googlegenomics / dataflow-java

Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
Apache License 2.0
36 stars 31 forks source link

Enforce window within merge logic. #200

Closed deflaux closed 8 years ago

deflaux commented 8 years ago

Corresponding changes per https://github.com/googlegenomics/utils-java/pull/95

Sending this to a branch for now. Travis CI will turn green when the next version of utils-java is released.

dionloy commented 8 years ago

LGTM

(I assume the build breakage is due to utils-java version not ready yet?)

deflaux commented 8 years ago

Yes, that's right. This won't get merged to master until Travis CI is green.

pgrosu commented 8 years ago

Hi Nicole,

So just to be sure, after reading through the utils-java and dataflow-java code-bases, how and why are you merging non-variant segments if the calls for IBS similarity calculation is focused at coordinate-based snp comparisons? Basically how would the non-variant segments without additional alternate alleles besides <NON_REF> provide any significant information without diluting the IBS score, which is reflected in the scores in the 1000 Genomes analysis for chromosome 22. The genotype for those regions are usually 0/0 which is basically your reference, and are blocked by quality with the non-variant regions being different for different cohorts. If you want to say that most of the reference stayed the same, that can be tricky to calculate if the non-variant blocks are not relatively stationary.

If you are getting the call-list from MergeNonVariantSegmentsWithSnps.java via updatedRecord.addAllCalls(blockRecord.getCallsList()); to then perform addAllCalls( java.lang.Iterable<? extends com.google.genomics.v1.VariantCall> values) from v1/Variant.java, what would be the calls used for comparison from a non-variant segment and how are they compared to snp calls in the same regions for the IBS score calculation?

Thanks, ~p

deflaux commented 8 years ago

Hi Paul! There are two methods so far for calculating IBS:

deflaux commented 8 years ago

@pgrosu I deleted your comment because it contained code. Due to the rules of this project, you must sign the CLA or refrain from including code in any comments. We really value your participation and help. I know you put a lot of effort into your comments, and so it wastes your effort when we must delete them. Please help us and sign the CLA or stop including code in your comments.

pgrosu commented 8 years ago

@deflaux It was all your code that is part of this repository, and I was not contributing. I just wanted to make you aware there are issues with your approach of calculating IBS, and that the calculation is not always specific to the calls. I thought that GoogleGenomics wanted to be as scientific and mathematically correct about contributed code. Isn't that the goal?