tony-framework / TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
https://tony-project.ai
Other
708 stars 164 forks source link

Make dependency-group-timeout check ignored until all tasks scheduled #623

Closed zuston closed 2 years ago

zuston commented 2 years ago

Bug Fix

When some resources are not satisfied and the conf of dependency-timeout-check is specified, it will throw exception. like:

2021-12-09 06:18:04 INFO  ApplicationMaster:1199 - Successfully started container container_e03_1582553233674_1290236_01_000033
2021-12-09 06:18:04 ERROR TFRuntime:149 - Failed to check dependency timeout.
java.lang.NullPointerException
    at com.linkedin.tony.runtime.MLGenericRuntime.lambda$groupDependencyTimeout$1(MLGenericRuntime.java:211)
    at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.LongPipeline.reduce(LongPipeline.java:443)
    at java.util.stream.LongPipeline.max(LongPipeline.java:406)
    at com.linkedin.tony.runtime.MLGenericRuntime.groupDependencyTimeout(MLGenericRuntime.java:212)
    at com.linkedin.tony.runtime.MLGenericRuntime.isHealthy(MLGenericRuntime.java:147)
    at com.linkedin.tony.ApplicationMaster.monitor(ApplicationMaster.java:749)
    at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:422)
    at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:356)

Solution

To get the accurate running tasks info, we should make dependency-group-timeout check ignored until all tasks scheduled.

Tips: I add the test case for above meeting problems, if you remove the MLGenericRuntime.groupDependencyTimeout, and then you could rerun this testPartialTaskScheduledShouldPass test case and reproduce the problem.