googlegenomics / pipelines-api-examples

Examples for the Google Genomics Pipelines API.
BSD 3-Clause "New" or "Revised" License
50 stars 27 forks source link

Continuing our discussion for simplifying pipelines (and their examples) #13

Open pgrosu opened 8 years ago

pgrosu commented 8 years ago

Hi Matt (@mbookman),

So to continue our discussion from https://github.com/googlegenomics/pipelines-api-examples/pull/10#discussion_r55108121, I understand the REST interface here:

https://www.googleapis.com/discovery/v1/apis/genomics/v1alpha2/rest

But this is too cumbersome for bioinformaticians who just want a turn-key solution and to run stuff. The examples are great, but we should have secondary ones to simplify them, which will increase the audience spectrum. This includes the ability for multiple files. This can be done now, even if the backend does not support it directly. Also include examples of connected pipelines as workflows and nested pipelines examples - and yes, there are several ways :)

So with each example there should be pipelines like this, which are defined in a file that the program (Python/R/Java, etc) will pick up and adapt to the REST interface. Here one provides only the necessary information, and the parser will transform the generalized names and also fill out the required on it's own:

Pipeline:

    name: 'fastqc'
    CPU: 1
    RAM: 3.75 GB

    disks:
      name: 'datadisk'
      mountPoint: '/mnt/data'
      size: 500 GB
      persistent: true

    docker: 
      image: 'gcr.io/PROJECT_ID_ARGUMENT/fastqc'

      cmd: ( 'mkdir /mnt/data/output && '
             'fastqc /mnt/data/input/* --outdir=/mnt/data/output/' )

     inputParameters:

       name: inputFile + [idx : 1...len(INPUT)]

       location: 
         path: 'input/'
         disk: 'datadisk'

     outputParameters:

       name: 'outputPath'

       location:
         path: 'output/*'
         disk: 'datadisk'

   pipelineArgs:

    RAM: 1 GB
    disks:
           name: 'datadisk'
           size: DISK_SIZE_ARGUMENT
           persistent: true

     inputs:
        inputFile + [idx : 1...len(INPUT)]
     outputs:
        path: OUTPUT_ARGUMENT

    logging:
      path: LOGGING_ARGUMENT

Let me know what you think.

Thanks, Paul

mbookman commented 8 years ago

Hi Paul,

Yes, a text file-based approach makes a lot of sense and there will be a progression here of examples and preferred approaches.

The pipelines API has just been made available as an open alpha and there is still much to be done to both refine the API as well as the usage around it. The examples published thus far were specifically requested by developers integrating with Python.

gcloud support is coming, which will provide a command-line interface to pipelines, along with the ability to specify a pipeline in JSON (and possibly YAML).

When gcloud support is available, many of the examples will then be gcloud-based, and we will continue to drive towards better user experiences, particularly for the non-developer.

-Matt

pgrosu commented 8 years ago

Hi Matt,

Do you have something I can test locally - like api-provider-local-java used to be, but for pipelines - to try out a few ideas for backend pipeline processing of the current implementation you guys have over there? The example above is not quite as smooth as I would like it to be, and I think I want to experiment with a few dynamically changing pipelines to continue driving our discussion.

Basically I feel the concept of pipelines is too static for some situations, and there are some nice ideas, which could help out the users. Below is an example, which would be digested and thus content addressable:

1) You have a web- or gs-accessible build config like this (i.e. gs://pipelines/dynamic-build), which would auto-configure itself to the size of the job - the p before the disk means it is considered persistent:

name: dynamic-build
  (cpulist: 1...16)
  (ram: 2 GB...8 GB)
  (disk: 100 GB...1000 GB)
  (disk: p'/mnt/data', p'/mnt/tmp', p'/mnt/research')

2) Then you launch it in a nested way, and stream-filter the results to a web page of the failed results from the FastQC summary.txt files:

  $ ./run-dyn-workflow --config=gs://pipelines/dynamic-build
                        --input=nested-workflow={
                                run-dyn-workflow --config=gs://pipelines/dynamic-build \
                                     --image='gcr.io/`echo z${PROJECT_ID}`/fastqc', \
                                     --input= '(inputFile + [idx : 1...len(INPUT)], 
                                              "good_sequence_short.fastq.gz", "small_rna.txt")', \
                                     --output=useAllDisks,
                   cmd= 'mkdir 3-8-3016_fastqc && fastqc DYNAMIC_MOUNT/input/* 
                         --outdir=DYNAMIC_MOUNT/output/ && cat DYNAMIC_MOUNT/output/*/summary.txt' },
              --cmd=' cat - | grep FAIL >> ~/www/html-report.html'
              --output=DYNAMIC_MOUNT

Thanks, Paul

mbookman commented 8 years ago

Hi Paul,

Sorry, there is no reference server to test with locally.

The idea of the pipelines API as it stands is to provide this very simple, but powerful, building block. It will enable many different pipeline runners (like cromwell) to add a "run in cloud" feature to their existing workflow definition files without the need to explicitly provision a fixed cluster (like Grid Engine or Mesos).

-Matt

pgrosu commented 8 years ago

Hi Matt,

I agree it is powerful, but there is lost potential by not tackling some infrastructure and flexibility issues at the start. So there is nothing wrong in being a service, but think about 5 - 10 years down the line. All this processed data is not organized and properly searchable - I'm not even talking provenance.

Computing power and cloud storage will only get cheaper, which will allow users to start implementing their own mini-backends on the cloud. Why not help users by providing some more structure for them to build upon, that will help them work even more effectively?

The key question is, why should users use the GoogleGenomics API - which we've worked so hard to streamline over the past two years - if they can just bypass that and run their own custom pipelines? GoogleGenomics has a huge untapped potential, and through more integration it will change how genomic analysis will be performed.

Paul