googlegenomics / bigquery-examples

Advanced BigQuery examples on genomic data.
Apache License 2.0
89 stars 31 forks source link

Query the BigGene variant data and create some basic plots of the data. #2

Closed chrisroat closed 10 years ago

chrisroat commented 10 years ago

Create an R markdown file with code that uses the bigrquery to pull data from the BigGene variants table, and make some exploratory plots (by someone completely new to genomics data). Include the md and html versions created by rstudio, as well.

deflaux commented 10 years ago

Very cool plots - thank you!

pgrosu commented 10 years ago

Hi Chris and Nicole,

It looks very intuitive and nice, and the BigQuery Browser Tool examples are very responsive. It would be nice to not have billing enabled while the project is still in development, otherwise it would detract scientists/analysts from trying it out too much.

One thing, via performing query_exec after substituting my own project ID was that I got the following error in R - the SQL query performed was minimum-allelic-frequency-by-ethnicity.sql:

Error: Exceeded quota: too many query byte volume for this project

 quotaExceeded. Exceeded quota: too many query byte volume for this project

This was following the instructions on the following page through R:

https://github.com/googlegenomics/bigquery-examples/tree/master/1000genomes/data-stories/literate-programming-demo

It would also be nice to have the option to run a more general query in the Browser Tool, and then drill down on the results via another query (i.e. filter on the query results), without saving the view. Most scientists are not experts in efficient SQL query design and nomenclature and usually they drill down in small steps. Yes, views can be saved but I am not seeing them in the Browser Tool and general users probably do not want to figure out each time which saved view to go to, especially if they will have a lot.

One nice option if possible would be select columns and perform some simple plots via the browser. That would create a nice interactive environment as well - it might be a little difficult now but something that would go a long way since not everyone is an R/iPython/Pandas expert.

Thanks, Paul

siddharthab commented 10 years ago

Very cool plots indeed! Can we also add titles or captions to the plots so that they are readable on their own? Also please units for Likelihoood, and ideally tickmarks (without labels) when we have chromosome numbers on the x-axis.

chrisroat commented 10 years ago

Hi Siddhartha,

Thanks for the feedback. I'm relatively new to R - this was my first "real" R script. If you have suggestions how to improve the code, I'd love to learn from you. Feel free to submit updates to the script via a pull request. You can see how to do that here:

https://help.github.com/articles/creating-a-pull-request

Cheers, C

On Wed, May 7, 2014 at 2:43 PM, Siddhartha Bagaria <notifications@github.com

wrote:

Very cool plots indeed! Can we also add titles or captions to the plots so that they are readable on their own? Also please units for Likelihoood, and ideally tickmarks (without labels) when we have chromosome numbers on the x-axis.

— Reply to this email directly or view it on GitHubhttps://github.com/googlegenomics/bigquery-examples/pull/2#issuecomment-42486943 .

siddharthab commented 10 years ago

Sure, I can do that. I remember getting Greek symbols on the axis labels was the toughest part with plots, but luckily we don't have to do that here. :) If you don't see a pull request this week, then expect by end of month.

pgrosu commented 10 years ago

@siddharthab, not always :) Check this out - it has gotten easier over the years :)

plot(x=seq(-3, 3, 0.001), sin(seq(-3, 3, 0.001)), xlab=expression(phi), ylab=expression(Sin(phi)))

image

pgrosu commented 10 years ago

Hi Chris,

If this was your first attempt at R, I am very impressed!!! It took me years to become comfortable with it.

Keep giving us more of these awesome results :) Paul

deflaux commented 10 years ago

Hi Paul,

Thank you for your feedback!

The BigQuery Browser Tool has a very nice query historyhttps://developers.google.com/bigquery/bigquery-browser-tool#queryhistory feature. I often use it to iteratively write queries by viewing prior queries, clicking on "edit query" and modifying the query for the next small step I would like to take.

Thanks! Nicole

On Wed, May 7, 2014 at 12:28 PM, Paul Grosu notifications@github.comwrote:

@chrisroat https://github.com/chrisroat, @deflauxhttps://github.com/deflaux: It looks very intuitive and the BigQuery Browser Tool examples are very responsive. It would be nice to not have billing enabled while the project is still in development, otherwise it would detract scientists/analysts from trying it out too much.

One thing, via performing query_exec after substituting my own project ID was that I got the following error in R - the SQL query performed was minimum-allelic-frequency-by-ethnicity.sql:

Error: Exceeded quota: too many query byte volume for this project

quotaExceeded. Exceeded quota: too many query byte volume for this project

This was following the instructions on the following page through R:

https://github.com/googlegenomics/bigquery-examples/tree/master/1000genomes/data-stories/literate-programming-demo

It would also be nice to have the option to run a more general query in the Browser Tool, and then drill down on the results via another query (i.e. filter on the query results), without saving the view. Most scientists are not experts in efficient SQL query design and nomenclature and usually they drill down in small steps. Yes, views can be saved but I am not seeing them in the Browser Tool and general users probably do not want to figure out each time which saved view to go to, especially if they will have a lot.

One nice option if possible would be select columns and perform some simple plots via the browser. That would create a nice interactive environment as well - it might be a little difficult now but something that would go a long way since not everyone is an R/iPython/Pandas expert.

Thanks, Paul

— Reply to this email directly or view it on GitHubhttps://github.com/googlegenomics/bigquery-examples/pull/2#issuecomment-42472110 .

pgrosu commented 10 years ago

Hi Nicole,

Thank you, and yes I noticed the query history the other day. The thing is that users probably do not want to rerun queries – especially if they are costly or lengthy time-wise.

The idea is to have a temporary view saved of the current results on which users can filter on by running another simple query on this view. Both you and I have specialized domain expertise to know how to easily adapt our queries in drilling down for uncovering interesting results. I would not expect a significant subset of users (i.e. bench biologists) to know how to perform left outer joins with ease, where their input is very valuable because of their extensive knowledge of a disease subtype and impact of molecular mechanisms based on the presence of specific variations in sequence.

For instance if one wanted to compare the 1000 Genomes with the HapMap Project that could be quite a challenge for some and not for others.

Maybe something to think about for a future version :)

Paul