Use EMR release 4.4.0 - Githubissues

omarkhan commented 8 years ago

This pull request updates the analytics configuration to work on EMR release 4.4.0. It is based on work done in https://github.com/open-craft/edx-analytics-configuration/pull/1.

Tested with https://github.com/edx/edx-analytics-pipeline/pull/214, which also needs to be used for this to work.

So far I have only tested that clusters provision successfully and that nodes include all the necessary software. I have not been able to actually run any tasks.

Changes:

Use hadoop, ganglia, hive, pig, and sqoop from the EMR distribution
Use m1.medium instances as a minimum, otherwise nodes fail to start

JIRA ticket: AN-6777

openedx-webhooks commented 8 years ago

Thanks for the pull request, @omarkhan! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request.

To automatically create an OSPR issue for this pull request, just visit this link: https://openedx-webhooks.herokuapp.com/github/process_pr?number=29&repo=edx%2Fedx-analytics-configuration

brianhw commented 8 years ago

@omarkhan My apologies for commenting on this prematurely, but I'm concerned about the approach of removing support for other EMR versions. In particular, I expected at edX at least that we would need to move over from old to new versions in stages, as various task workflows are validated when run on old and new versions. This PR would not be mergeable until all tasks have been validated. (FYI: @mulby )

omarkhan commented 8 years ago

@brianhw noted, thanks for letting me know. I will bring back support for the debian-based EMR release.

omarkhan commented 8 years ago

@mtyaka @brianhw this now supports the old debian squeeze based EMR release again. Note this change to the pipeline: https://github.com/edx/edx-analytics-pipeline/pull/226

omarkhan commented 8 years ago

@brianhw I have tested this manually on EMR 4.4.0 with the AnswerDistributionWorkflow task and it seems to work, as long as I use this version of ManifestTextInputFormat instead of the oddjob one. I am not sure how to run the other tasks though. Can we run the acceptance tests on EMR 4.4.0? How do we do that?

mtyaka commented 8 years ago

The changes look good to me. I was able to provision a 4.4.0 EMR cluster and successfully run a test task :+1:

omarkhan commented 8 years ago

@mulby @brianhw this works for us now. What further steps do we need to take to get this merged?

omarkhan commented 8 years ago

@mulby @brianhw to be more specific, we have been able to run these jobs:

AnswerDistributionWorkflow
ImportEnrollmentsIntoMysql
InsertToMysqlCourseEnrollByCountryWorkflow
CourseActivityWeeklyTask
InsertToMysqlAllVideoTask

ImportEnrollmentsIntoMysql and InsertToMysqlAllVideoTask are complaining about empty result sets though:

Traceback (most recent call last):
  File "/var/lib/analytics-tasks/automation/venv/local/lib/python2.7/site-packages/luigi/worker.py", line 292, in _run_task
    task.run()
  File "/var/lib/analytics-tasks/automation/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/mysql_load.py", line 330, in run
    self.insert_rows(cursor)
  File "/var/lib/analytics-tasks/automation/venv/local/lib/python2.7/site-packages/edx/analytics/tasks/mysql_load.py", line 300, in insert_rows
    raise Exception('Cannot overwrite a table with an empty result set.')
Exception: Cannot overwrite a table with an empty result set.

Is this normal? It does the same thing on both EMR 2.4.11 and on 4.4.0.

brianhw commented 8 years ago

AWS has actually released EMR 4.5.0, so we hope to be trying that as well with this code.

The error message you are getting occurs when a Load-Mysql task tries to read from a Hive table that contains no data. The Hive table, in turn, may be empty because of an error in running a job previously, which may have left around an empty directory as its output. Often that can be remedied by either deleting the empty directory or running with a different interval that would write to a different output directory (i.e. a different partition). If the same parameters are passed to the job on 2.4.11 and 4.4.0, and the empty directory exists, then I would indeed expect the same failure to occur for either. (We're hoping to make changes in the near future so that such empty directories are less likely to occur. They happen when a Hive "create table" succeeds but a subsequent populate task fails.)

I think the next steps are for us to verify that one of our 2.4.11 jobs can run with this code (and the corresponding code in pipeline), and for us to confirm that acceptance tests work on a 4.5.0 (or failing that, 4.4.0) cluster. Then we could merge, and start moving tasks over gradually.

omarkhan commented 8 years ago

Thanks for the guidance @brianhw, much appreciated. I will investigate why the Hive table is empty.

I will also try running these with EMR 4.5.0. The changes look pretty minor, I don't expect there to be any problems.

brianhw commented 8 years ago

Yes, but it looks like 4.5.0 moves from Hadoop 2.7.1 to 2.7.2, so there's always a risk with that. : )

omarkhan commented 8 years ago

I have tried running the task again, with a different interval. The CourseEnrollmentTableTask runs for a while, but when it completes the hive table appears empty:

I will do some more digging to understand why this is happening.

omarkhan commented 8 years ago

I have tried running the mapper and reducer manually on the same data I have been using on EMR, it all. seems to work fine. Something may be going wrong when writing the data out to a hive table.

mulby commented 8 years ago

@omarkhan it sounds like you may be using an interval in which there are no enrollment events. Double check the date ranges of enrollment events in your input tracking logs and pass that date interval into the enrollment task.

If there is a mismatch between the input data and the interval, the job will run successfully but produce no output.

omarkhan commented 8 years ago

Thanks @mulby. I checked that already, I'm using the last 3+ years as the interval, and there are definitely enrolment events in that timeframe in the tracking logs I am using.

I have set up a simple word count task that bypasses the event tracking logic to try and isolate the problem.

mulby commented 8 years ago

Also check the manifest file that is generated to ensure that the task is finding all of your tracking logs correctly.

omarkhan commented 8 years ago

It's not using a manifest as I only have around 400 files. It passes them in directly as input to mrrunner.py.

mulby commented 8 years ago

And you are confident they are being read correctly?

omarkhan commented 8 years ago

I'm not. I just ran a simple word count task and it worked fine, the output ended up in S3 as expected. So it may not be reading the input correctly.

mulby commented 8 years ago

Once this passes regression tests, it looks good to me :+1:

omarkhan commented 8 years ago

Thanks @mulby. Do you have a test setup for this?

mulby commented 8 years ago

@omarkhan we have a manually triggered acceptance test suite that we run

mulby commented 8 years ago

@brianhw did this pass the tests? If so, we should merge into master.

brianhw commented 8 years ago

I'm going to use this branch for running release candidate branches for releasing pipeline with the luigi patch. That should test the 2.4.11 behavior. I did run acceptance tests against 4.5.0, and they didn't all pass, but I didn't expect them all to. That will require more inspection, but I expect there to be some system installs that need to be specified in the pipeline Makefile at the very least (e.g. for gpg).

brianhw commented 8 years ago

Okay. Release candidate runs all started fine, and one finished with the same results. So this looks good for running against 2.4.11 configurations.

I also looked at the AcceptanceTest output. I was pleased to see that most of the tests actually completed successfully. One failure was expected (a bug that has now been fixed), but there were four failures. Three were due to problems with the manifest file format. These were the two answer_distribution tests, and the event export test. The error was subtly reported at info level:

2016-04-11 05:37:01,132 INFO 29699 [luigi-interface] hadoop.py:265 - -inputformat : class not found : oddjob.ManifestTextInputFormat
2016-04-11 05:37:01,133 INFO 29699 [luigi-interface] hadoop.py:265 - Try -help for more information
2016-04-11 05:37:01,133 INFO 29699 [luigi-interface] hadoop.py:265 - Streaming Command Failed!

For answer distribution, the tests hardcode INPUT_FORMAT = 'oddjob.ManifestTextInputFormat'. For event export, it is in EventExportAcceptanceTest.test_event_log_exports_using_manifest, and it is presumably using the input format specified in the config file, which is going to also be wrong.

For the record, the last failure is due to a "diff" mismatch in output from EventExportByCourseAcceptanceTest.test_events_export_by_course, because event keys were not output in the same order as input. The comparison needs to be made more robust.

Anyway, I'm wondering if your issues with some jobs not producing output might be related to the input format needing to change. I'm assuming you know about it, since you said you did get answer distribution to work, but maybe it's not getting set properly for the other jobs.

brianhw commented 8 years ago

Oh, and :+1: for merging this PR!

brianhw commented 8 years ago

Actually, this should get squashed first, before merging.

mtyaka commented 8 years ago

@brianhw Thanks for the review and the details! I squashed the commits as @omarkhan is away this week. We don't have merge permissions in this repo - can you merge, please?

omarkhan commented 8 years ago

@brianhw thanks for merging. Re the manifest issue you mentioned: the oddjob class you are using fails on recent releases of EMR with this cryptic error:

Exception in thread "main" java.lang.ExceptionInInitializerError
  at clojure.core__init.__init0(Unknown Source)
  at clojure.core__init.<clinit>(Unknown Source)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:278)
  at clojure.lang.RT.loadClassForName(RT.java:2098)
  at clojure.lang.RT.load(RT.java:430)
  at clojure.lang.RT.load(RT.java:411)
  at clojure.lang.RT.doInit(RT.java:447)
  at clojure.lang.RT.<clinit>(RT.java:329)
  at clojure.lang.Namespace.<init>(Namespace.java:34)
  at clojure.lang.Namespace.findOrCreate(Namespace.java:176)
  at clojure.lang.Var.internPrivate(Var.java:163)
  at oddjob.ManifestTextInputFormat.<clinit>(Unknown Source)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:278)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.streaming.StreamUtil.goodClassOrNull(StreamUtil.java:51)
  at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:897)
  at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:124)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
  at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalStateException: Attempting to call unbound fn: #'clojure.core/refer
  at clojure.lang.Var$Unbound.throwArity(Var.java:43)
  at clojure.lang.AFn.invoke(AFn.java:39)
  at clojure.lang.Var.invoke(Var.java:415)
  at clojure.lang.RT.doInit(RT.java:460)
  at clojure.lang.RT.<clinit>(RT.java:329)
  ... 29 more

It seems to work if you use this class instead.

I don't think the issue I am having with CourseEnrollmentTask not producing any output has anything to do with the manifest format, as there are few enough input files to be passed directly on the command line. Also I have the same problem with the old EMR version and the new release, so I don't think it's a regression. Since this PR is now merged and you don't have the problem, I am giving up for now.

brianhw commented 8 years ago

@omarkhan Thanks for the pointer to the class. That is what we have been using for all Hadoop 2 runs (including EMR4.4+). My point was that the manifest input format is the name of the class, so we have made it a configuration variable to pass into the answer-distribution acceptance tests so that it can be changed when acceptance tests are run against EMR 2.4.11 versus 4.5.0. But if you were having the same problem with 2.4.11, then this is indeed probably not your issue.

openedx-unsupported / edx-analytics-configuration

Use EMR release 4.4.0 #29