googlegenomics / bigquery-examples

Advanced BigQuery examples on genomic data.
Apache License 2.0
89 stars 31 forks source link

contig in variants1kG versus chrom in knownGene #7

Open maxbox51 opened 10 years ago

maxbox51 commented 10 years ago

There are two forms of sequence identifier being used in different tables for the same sequence. The variant1kG table refers to chromosome 1 in field "contig" as "1", while the knownGene table refers to chromosome 1 in field "chrom" as "chr1". We have this issue with the data as it comes from the UCSC site, too, but it's a bigger deal if you're trying to join across tables. It is possible, of course, to shorten "chr1" to "1" in a subquery, but I suspect it is more efficient to avoid the subquery when possible.

Also, this is a much more minor issue, but it would be nice if the field names containing the same values were the same across tables, to indicate that they are an appropriate field to join on. Both chromosomes and contigs (which is a more general term, but suggests less than a chromosome) are sequences, and I personally would prefer both of them to be replaced by "sequence" or "seq".