Adding documentation for the CalculateCoverage dataflow pipeline.

Careyjmac commented 9 years ago

As per @deflaux 's request, here is a simple tutorial on what the CalculateCoverage pipeline does, and how to go about running it locally on the command line.

googlebot commented 9 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please let us know the company's name.

deflaux commented 9 years ago

Nice work @Careyjmac !!!

I just have a couple nit-picky things but otherwise it looks good to me. @mbookman anything to add?

sometimes there is "data set" and "dataset" or "read group set" and "ReadGroupSet"; please go ahead and make all the type names consistent
the API explorer links include the google analytics tracking tag, please goahead and strip those out

Careyjmac commented 9 years ago

Done and done. :)

pgrosu commented 9 years ago

Why does the output annotation have to go into a Dataset? Why can't I test it locally as standard output for just retrieving the results as an inquiry into the coverage of a set of ReadGroupSets for other analyses? It feels a little cumbersome :( Maybe for now just add a temporary public dataset id until the Annotation system is updated to accommodate such pipelining.

Thanks, Paul

dionloy commented 9 years ago

Hi Paul,

All data generated by our API is encompassed within a Dataset for permission control. While it's entirely possible to add a file based sink, these are intended as examples for cloud-based genomics processing (and thus writing to local files would be muddling that message a bit). Users are welcome to branch and modify the examples for their own purposes of course. Thanks!

On Thu, Jun 4, 2015 at 1:52 PM Paul Grosu notifications@github.com wrote:

Why does the output annotation have to go into a Dataset? Why can't I test it locally as standard output for just retrieving the results as an inquiry into the coverage of a set of ReadGroupSets for other analyses? It feels a little cumbersome :( Maybe for now just add a temporary public dataset id until the Annotation system is updated to accommodate such pipelining.

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/googlegenomics/start-here/pull/47#issuecomment-109046291 .

pgrosu commented 9 years ago

Hi Dion,

Thank you for the clarification and I totally agree, but when I noticed that it mentioned in the calculate_coverage.rst document in this PR that we can run it locally (without --runner=DataflowPipelineRunner), I interpreted that the results can also be seen locally for troubleshooting purposes - similar to how I can perform the following locally via Readgroupsets.coveragebuckets whose results are outside the permissions of a Dataset:

$ curl -X GET -H "Content-Type: application/json"  https://www.googleapis.com/genomics/v1beta2/readgroupsets/CMvnhpKTFhD04eLE-q2yxnU/coveragebuckets?key=<removed>
{
 "coverageBuckets": [
  {
   "range": {
    "referenceName": "1",
    "start": null,
    "end": "249250621"
   },
   "meanCoverage": 8.670173
  },
...
$

I'll continue to explore more on my own via TextIO.Write.

Thanks, Paul

pgrosu commented 9 years ago

Hi Dion,

I just thinking of this some more, and realized that coverage is not really an annotation but the result of some performed analysis and more reasonably should have its own type definition under a general processed result category. Until then it probably should be saved to a gs bucket that has an association to the dataset.

Let me know what you think.

Thanks, Paul

dionloy commented 9 years ago

That would work too. However we are treating 'annotation' as very loose definition for any data that can be mapped to a genomic range (either per sample or reference). We'll see how it works out =). On Jun 4, 2015 6:43 PM, "Paul Grosu" notifications@github.com wrote:

Hi Dion,

I just thinking of this some more, and realized that coverage is not really an annotation but the result of some performed analysis and more reasonably should have its own type definition under a general processed result category. Until then it probably should be saved to a gs bucket that has an association to the dataset.

Let me know what you think.

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/googlegenomics/start-here/pull/47#issuecomment-109122061 .

pgrosu commented 9 years ago

Sounds good :) It's still fun to play around, and we're lucky to be at this point in time to create these new standards for the cloud for genomic data.

Have a good one, `p

Careyjmac commented 9 years ago

All of @mbookman 's requested changes have been made.

pgrosu commented 9 years ago

Thank you for merging, but I'm still getting a "Sorry this page does not exist yet" when trying to access the following link as noted in line 74 of CalculateCoverage.java :(

http://googlegenomics.readthedocs.org/en/latest/use_cases/analyze_reads/calculate_coverage.html

~p

googlegenomics / readthedocs

Adding documentation for the CalculateCoverage dataflow pipeline. #47