Closed deflaux closed 8 years ago
Hi Nicole,
Looks nice! A couple of small things:
https://github.com/deflaux/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev87-1.20.0/v1-rev56-1.21.0
https://github.com/deflaux/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0
Thanks, ~p
LGTM
Thanks! For my own education, how did you discover they were being fused?
Hi Dion and Nicole,
You might want to be aware that the OpenJDK 7 builds are not properly working. If you look at the end of the raw log file, you will see the following core dump and error:
7f10a58bd000-7f10a58be000 r--p 00022000 08:01 3267 /lib/x86_64-linux-gnu/ld-2.15.so
7f10a58be000-7f10a58c0000 rw-p 00023000 08:01 3267 /lib/x86_64-linux-gnu/ld-2.15.so
7ffe40f0b000-7ffe40f2f000 rw-p 00000000 00:00 0 [stack]
7ffe40fd9000-7ffe40fdb000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
[ERROR] Aborted (core dumped)
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Skipping Google Genomics and Dataflow
[INFO] This project has been banned from the build due to previous failures.
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15.389 s
[INFO] Finished at: 2016-02-06T00:02:10+00:00
[INFO] Final Memory: 35M/217M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:cobertura-maven-plugin:2.7:instrument (default-cli) on project google-genomics-dataflow: Unable to instrument project. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
travis_time:end:0347a42a:start=1454716914055495207,finish=1454716931072952112,duration=17017456905
[0Ktravis_fold:end:after_success
[0K
Done. Your build exited with 0.
~p
Hi Dion,
Regarding fusion optimizations of a Dataflow graph, you can monitor your worker log and check for fusion steps to see the worker load. Below is a screenshot from the following page under the "Detecting an Exception in Worker Code" heading - notice the fusion steps:
If you want to know more about fusion optimization, it is documented at the following link, under the "Preventing Fusion" heading:
https://cloud.google.com/dataflow/service/dataflow-service-desc#Optimization
The specific example details are the following:
For example, one case in which fusion can limit Dataflow's ability to optimize worker usage is a "high fan-out" ParDo. In such an operation, you might have an input collection with relatively few elements, but the ParDo produces an output with hundreds or thousands of times as many elements, followed by another ParDo. If the Dataflow service fuses these ParDo operations together, parallelism in this step is limited to at most the number of items in the input collection, even though the intermediate PCollection contains many more elements.
Hope it helps, Paul
This was tested manually to ensure that the number of active Dataflow tasks was larger than the number of read group sets.