googlegenomics / bigquery-examples

Advanced BigQuery examples on genomic data.
Apache License 2.0
89 stars 31 forks source link

Unique keys in 1k genomes #5

Closed maxbox51 closed 10 years ago

maxbox51 commented 10 years ago

In https://github.com/googlegenomics/bigquery-examples/tree/master/1000genomes/data-stories/understanding-alternate-alleles,

The 1000 Genomes data never has more than a single value for reference_bases per <contig,position> pair, so reference_bases need not be in the unique key definition you give.

maxbox51 commented 10 years ago

Actually, according to this query,

SELECT
  contig,
  position,
  vt,
  end,
  COUNT(alt) as n
FROM
(SELECT
  contig,
  position,
  vt,
  end,
  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alt
FROM
[google.com:biggene:1000genomes.variants1kG]
#[google.com:biggene:test.variants1kG_tiny]
)
GROUP EACH BY
  contig, position, vt, end
having
   n > 1;

Which returns no results, the alternative variants aren't necessary for an unique key, either. This makes sense: you only need one list of alternatives values relative to the single default (reference) value.