Open 0x10FF opened 9 years ago
Looks related to: https://issues.apache.org/jira/browse/YARN-2893
We don't see this at Twitter regularly, but intermittently. @gerashegalov any comments here?
Actually this is not a rare case internally either. We saw it on several jobs and it's still happening, but not consistently on the same job. So there is no clear pattern of the cause yet.
Thank you guys. Could it be related to parallel job execution? A similar Scalding job with simple flow of one tap-(few ops)-sink executes no problem on same large data set. I'm guessing this might be hiding some basic capacity problem...
@johnynek Yes, it's definitely related to YARN-2893. We have a failure rate from about 2 to 12 failed jobs per day per cluster. This is a small fraction given our daily cluster throughputs. The failures always come in temporally (around the same time) and physically (the same client machine) co-located groups of 2 and more jobs. The current theory they all belong to the same flow. The bug has not been reproducible for us on resubmission.
I get a sense that @karolw can reproduce it more deterministically, can you confirm?
Hi @gerashegalov, it seems I can reproduce it with large data sets. Let me know if there are some additional options I should maybe add to my configuration so that it logs well. How can I provide more meaningful error report?
@karolw Are you able to consistently reproduce this bug and have any more client logs for it?
@reconditesea , last time I deployed our Scalding setup, issue occurs pretty consistently when dealing with larger than normal data sets. (normal=chunk/slice of time, large=entire data set). Large is ~10GB.
I can provide more logs from different services, let me know which ones would be most beneficial to you. I'm approaching another release date, so I will have a chance to reproduce it again.
Just a side note/question, given the above stack trace, could this be hiding a heap problem?
@karolw We couldn't reproduce it internally, so still not sure what might be the root cause. Personally I doubt it's a heap issue. Maybe can you paste all logs from your job submitter node?
@reconditesea Ok I will grab the logs from submitter node. We use AWS Data pipeline service, so all service logs should be available on S3 after the run.
@reconditesea No luck this time. The masterInstance node was increased to a bigger box, otherwise no sign of this problem. :(
Looks like the issue was fixed in 2.5.5 https://github.com/Cascading/cascading/commit/45b33bb864172486ac43782a4d13329312d01c0e. See YARN-2893 discussion
@gerashegalov Thank you for sharing. Going to try it out in next release.
Hadoop 2.2.0 Scalding 0.9.0 Cascading 2.5.2 Scala 2.10.4
I'm running into issues while running larger volumes of data:
Has this been fixed in prior release? Current fix seems to be have Scalding/Cascading produce smaller jobs i.e. one sink/one tap.
Thank you for taking a look,