googlegenomics / bigquery-examples

Advanced BigQuery examples on genomic data.
Apache License 2.0
89 stars 31 forks source link

merge "the same" variant in gvcf_variants and gvcf_variants_expanded #15

Open deflaux opened 10 years ago

deflaux commented 10 years ago

Since "the same" variant can occur on multiple records in our tables (see example records shown by query below), it breaks the experiment we are trying to do with table gvcf_variants_expanded.

In VCF format, the same variant can be encoded in multiple ways and we're not trying to solve that deeper issue here. Instead its just a simplistic grouping of data -> when the location in the genome, reference bases, and alternate bases match exactly, group those together in the same single record in the table.

SELECT
  contig_name,
  start_pos,
  reference_bases,
  alt,
  numsamples,
  dataset
FROM (
  SELECT
    contig_name,
    start_pos,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alt,
    COUNT(call.callset_name) WITHIN RECORD AS numsamples,
    'gvcf' AS dataset,
  FROM
    [google.com:biggene:test.pgp_gvcf_variants]
  WHERE
    contig_name = '2'
    AND start_pos=222324034
    AND reference_bases='CTA'),
  (
  SELECT
    contig_name,
    start_pos,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alt,
    COUNT(call.callset_name) WITHIN RECORD AS numsamples,
    'exp' AS dataset,
  FROM
    [google.com:biggene:test.pgp_gvcf_variants_expanded]
  WHERE
    contig_name = '2'
    AND start_pos=222324034
    AND reference_bases='CTA')
ORDER BY
  dataset,
  alt