googlegenomics / bigquery-examples

Advanced BigQuery examples on genomic data.
Apache License 2.0
89 stars 31 forks source link

Update the examples for recent bigrquery changes. #20

Closed craigcitro closed 10 years ago

craigcitro commented 10 years ago

PTAL @deflaux (if you'd rather just make this update yourself, I can drop this PR.)

I recently made a change to the signature of the query_exec function in bigrquery R package, which meant that all calls in the examples needed updating. I've done that here, with a few notes/caveats:

deflaux commented 10 years ago

@craigcitro thanks so much for this and your continued help with bigrquery and httr!

It has been merged via 78df32fc71779a126a220ce30f6eba125fa7feb1 into a fork where I making a bunch of other unrelated updates and I'll regenerate the md files all in one go.

pgrosu commented 10 years ago

Thanks @craigcitro and Nicole for all the hard work! So is this ready to be tested now?

Thanks, Paul

deflaux commented 10 years ago

Hi @pgrosu,

If you want to give the new version of bigrquery a spin

install.packages("httr")
devtools::install_github("bigrquery")

Thanks a bunch! Nicole

pgrosu commented 10 years ago

Hi Nicole,

I would glad to try it out. I'll let you know how it goes.

Many thanks for all the great work! Paul

pgrosu commented 10 years ago

Hi Nicole,

So I tested these using @craigcitro's changes found in this pull, that are not yet merged.

The one change I made is to replace sql with sql.query in the DisplayAndDispatchQuery() function, since sql is a reserved word (function), which is part of the dplyr package.

I was unable to test everything, since I got a couple of Exceeded quota: error messages. I would be happy to test more, but how would you recommend I go about having this quota limit lifted?

I will show the two examples that gave me that error, and then I'll show a complete end-to-end example that worked - which is consolidated from several Readme.Rmd files.

Below is the error message:

Error: Exceeded quota: too many query byte volume for this project

 quotaExceeded. Exceeded quota: too many query byte volume for this project

Below are the R commands that returned this error message:

 result <- DisplayAndDispatchQuery("./1000genomes/sql/sample-level-data-for-brca1.sql")

 result <- DisplayAndDispatchQuery("./1000genomes/sql/shared-variant-counts-by-ethnicity.sql")

Below is a complete example that worked:

install.packages("httr")
install.packages("devtools")
devtools::install_github("assertthat")
devtools::install_github("bigrquery")
install.packages("dplyr")
install.packages("ggplot2)
install.packages(xtable)
install.packages(testthat)

require(bigrquery)
require(ggplot2)
require(dplyr)
require(xtable)
require(testthat)

project <- "......" # My project ID which I anonymized :)

DisplayAndDispatchQuery <- function(queryUri) {
  sql.query <- readChar(queryUri, nchars=1e6)
  cat(sql.query)
  query_exec(sql.query, project)
}

result <- DisplayAndDispatchQuery("./1000genomes/sql/ratio-of-dbsnp-variants-by-chromosome.sql")

print(xtable(result, digits=6), type="html", include.rownames=F)

qplot(num_variants, num_dbsnp_variants, color=num_variants, data=result) +
  scale_colour_gradient("Number of Variants", labels=function(x)round(x)) +
  ylab("Number of dbSNP Variants") +
  xlab("Number of Variants") +
  ggtitle("dbSNP Variant Count vs. Total Variant Count by Chromosome") +
  geom_text(aes(label=contig), hjust=-1, vjust=0)

Thanks, Paul

deflaux commented 10 years ago

Hi Paul,

That's helpful feedback - thank you! I believe that error message occurs when billing has not been enabled for the Google Cloud Platform project and queries have exceeded the amount of data that is free of charge per month. (Its easy to exceed the threshold with the 1,000 Genome data since its quite large.) Does that sound correct to you?

Take-aways:

  1. Will rename sql throughout.
  2. The error message needs to be improved to be more clear.
  3. The getting started information should be more prominent and/or the first few samples presented should be on a smaller data set.

Thanks again, Nicole

pgrosu commented 10 years ago

Hi Nicole,

Glad to help out and yes, you are correct :) I didn't want to enable billing unless it was specific to a particular analysis project. Something that might be helpful, is to have temporary project IDs that don't have this limit, in order to properly perform all the tests.

Feel free to let me know anything else you'd like me to do regarding this.

Thanks, Paul

craigcitro commented 10 years ago

Also, thanks for spotting the sql thing -- I'll update that in docs for bigrquery, too.

pgrosu commented 10 years ago

Sure thing :)