twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 707 forks source link

java.io.EOFException #1166

Open 0x10FF opened 9 years ago

0x10FF commented 9 years ago

Hadoop 2.2.0 Scalding 0.9.0 Cascading 2.5.2 Scala 2.10.4

I'm running into issues while running larger volumes of data:

2015-01-20 21:49:59,752 WARN cascading.flow.FlowStep (pool-2-thread-1): [com.co.scalding...] failure info: Application application_1421789685841_0027 failed 2 times due to Error launching appattempt_1421789685841_0027_000002. Got exception: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:189)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:225)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:197)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:107)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
. Failing the application.

Has this been fixed in prior release? Current fix seems to be have Scalding/Cascading produce smaller jobs i.e. one sink/one tap.

Thank you for taking a look,

johnynek commented 9 years ago

Looks related to: https://issues.apache.org/jira/browse/YARN-2893

We don't see this at Twitter regularly, but intermittently. @gerashegalov any comments here?

reconditesea commented 9 years ago

Actually this is not a rare case internally either. We saw it on several jobs and it's still happening, but not consistently on the same job. So there is no clear pattern of the cause yet.

0x10FF commented 9 years ago

Thank you guys. Could it be related to parallel job execution? A similar Scalding job with simple flow of one tap-(few ops)-sink executes no problem on same large data set. I'm guessing this might be hiding some basic capacity problem...

gerashegalov commented 9 years ago

@johnynek Yes, it's definitely related to YARN-2893. We have a failure rate from about 2 to 12 failed jobs per day per cluster. This is a small fraction given our daily cluster throughputs. The failures always come in temporally (around the same time) and physically (the same client machine) co-located groups of 2 and more jobs. The current theory they all belong to the same flow. The bug has not been reproducible for us on resubmission.

I get a sense that @karolw can reproduce it more deterministically, can you confirm?

0x10FF commented 9 years ago

Hi @gerashegalov, it seems I can reproduce it with large data sets. Let me know if there are some additional options I should maybe add to my configuration so that it logs well. How can I provide more meaningful error report?

reconditesea commented 9 years ago

@karolw Are you able to consistently reproduce this bug and have any more client logs for it?

0x10FF commented 9 years ago

@reconditesea , last time I deployed our Scalding setup, issue occurs pretty consistently when dealing with larger than normal data sets. (normal=chunk/slice of time, large=entire data set). Large is ~10GB.

I can provide more logs from different services, let me know which ones would be most beneficial to you. I'm approaching another release date, so I will have a chance to reproduce it again.

Just a side note/question, given the above stack trace, could this be hiding a heap problem?

reconditesea commented 9 years ago

@karolw We couldn't reproduce it internally, so still not sure what might be the root cause. Personally I doubt it's a heap issue. Maybe can you paste all logs from your job submitter node?

0x10FF commented 9 years ago

@reconditesea Ok I will grab the logs from submitter node. We use AWS Data pipeline service, so all service logs should be available on S3 after the run.

0x10FF commented 9 years ago

@reconditesea No luck this time. The masterInstance node was increased to a bigger box, otherwise no sign of this problem. :(

gerashegalov commented 9 years ago

Looks like the issue was fixed in 2.5.5 https://github.com/Cascading/cascading/commit/45b33bb864172486ac43782a4d13329312d01c0e. See YARN-2893 discussion

0x10FF commented 9 years ago

@gerashegalov Thank you for sharing. Going to try it out in next release.