googlegenomics / dataflow-java

Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
Apache License 2.0
36 stars 31 forks source link

Fix parallelism for ReadGroupStreamer. #166

Closed deflaux closed 8 years ago

deflaux commented 8 years ago

This was tested manually to ensure that the number of active Dataflow tasks was larger than the number of read group sets.

pgrosu commented 8 years ago

Hi Nicole,

Looks nice! A couple of small things:

https://github.com/deflaux/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev87-1.20.0/v1-rev56-1.21.0

https://github.com/deflaux/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0

Thanks, ~p

dionloy commented 8 years ago

LGTM

Thanks! For my own education, how did you discover they were being fused?

pgrosu commented 8 years ago

Hi Dion and Nicole,

You might want to be aware that the OpenJDK 7 builds are not properly working. If you look at the end of the raw log file, you will see the following core dump and error:

7f10a58bd000-7f10a58be000 r--p 00022000 08:01 3267                       /lib/x86_64-linux-gnu/ld-2.15.so
7f10a58be000-7f10a58c0000 rw-p 00023000 08:01 3267                       /lib/x86_64-linux-gnu/ld-2.15.so
7ffe40f0b000-7ffe40f2f000 rw-p 00000000 00:00 0                          [stack]
7ffe40fd9000-7ffe40fdb000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[ERROR] Aborted (core dumped)

[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Skipping Google Genomics and Dataflow
[INFO] This project has been banned from the build due to previous failures.
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15.389 s
[INFO] Finished at: 2016-02-06T00:02:10+00:00
[INFO] Final Memory: 35M/217M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:cobertura-maven-plugin:2.7:instrument (default-cli) on project google-genomics-dataflow: Unable to instrument project. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

travis_time:end:0347a42a:start=1454716914055495207,finish=1454716931072952112,duration=17017456905
travis_fold:end:after_success

Done. Your build exited with 0.

~p

pgrosu commented 8 years ago

Hi Dion,

Regarding fusion optimizations of a Dataflow graph, you can monitor your worker log and check for fusion steps to see the worker load. Below is a screenshot from the following page under the "Detecting an Exception in Worker Code" heading - notice the fusion steps:

dataflow-dofn-exception

If you want to know more about fusion optimization, it is documented at the following link, under the "Preventing Fusion" heading:

https://cloud.google.com/dataflow/service/dataflow-service-desc#Optimization

The specific example details are the following:

For example, one case in which fusion can limit Dataflow's ability to optimize worker usage is a "high fan-out" ParDo. In such an operation, you might have an input collection with relatively few elements, but the ParDo produces an output with hundreds or thousands of times as many elements, followed by another ParDo. If the Dataflow service fuses these ParDo operations together, parallelism in this step is limited to at most the number of items in the input collection, even though the intermediate PCollection contains many more elements.

Hope it helps, Paul