Open anjackson opened 2 years ago
Can you try this out @GilHoggarth and see if it works?
The installation line fails to pull in the patch, which I guess is due to the latest policy changes around github access.
FYI, I'm running:
(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install -U git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
Collecting git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
Cloning https://github.com/ukwa/mrjob.git (to revision make-unpacking-archives-optional) to /tmp/pip-req-build-jjqmh0b4
fatal: unable to access 'https://github.com/ukwa/mrjob.git/': Failed connect to github.com:443; Connection timed out
Command "git clone -q https://github.com/ukwa/mrjob.git /tmp/pip-req-build-jjqmh0b4" failed with error code 128 in None
This installation works with mrjob:
(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install mrjob
Requirement already satisfied: mrjob in ./venv/lib/python3.7/site-packages (0.7.4)
Requirement already satisfied: PyYAML>=3.10 in ./venv/lib/python3.7/site-packages (from mrjob) (6.0)
Hacking the changes directly into bin.py
, I now get:
Traceback (most recent call last):
File "generate_checksums.py", line 20, in <module>
MRGenerateChecksum.run()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 616, in run
cls().execute()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 687, in execute
self.run_job()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 636, in run_job
runner.run()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/runner.py", line 503, in run
self._run()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/hadoop.py", line 326, in _run
self._create_setup_wrapper_scripts()
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 446, in _create_setup_wrapper_scripts
manifest=True)
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 495, in _write_setup_script
setup, manifest=manifest, wrap_python=wrap_python)
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 595, in _setup_wrapper_script_content
lines.extend(self._manifest_download_content())
File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 693, in _manifest_download_content
if self._opts['unpack_archives']:
KeyError: 'unpack_archives'
Quite understandably, you'll expect I made a mess of adding your code!
If I change line 693 to if 'unpack_archives' in self._opts and self._opts['unpack_archives'] != False:
the mrjob now runs, but eventually fails as an mr job:
map 0% reduce 0%
Task Id : attempt_1645461135252_47322_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
Task Id : attempt_1645461135252_47322_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
Task Id : attempt_1645461135252_47322_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
map 100% reduce 0%
Job job_1645461135252_47322 failed with state FAILED due to: Task failed task_1645461135252_47322_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0
Installation seems to work if the https_proxy
environment variable is set appropriately.
There were a few issues with the implementation, but it seems to work okay now. As per https://github.com/ukwa/mrjob/commit/e3901a21b397374ca001a974c95bb544eba6bb61
I've opened a PR (https://github.com/Yelp/mrjob/pull/2215) but we can just install our branch for now.
Installed via pip and seen to be working.
For the purpose of the our hadoop data migration, this patch works successfully. However, you might wish to keep this ticket open whilst the patch is waiting to be included upstream. Consequently, I'm unassigning myself from this ticket
Hm, attempted to request review in https://groups.google.com/g/mrjob but my post isn't turning up. Unless I messed up posting there somehow?
I'm developing a fork of MrJob that makes unpacking archives optional, here: https://github.com/ukwa/mrjob/tree/make-unpacking-archives-optional
It should be possible to install this into a venv using:
If that works, then update the MrJob config as per the updated docs:
Running the job with this configuration should skip the unpacking-archives step and leave the files as they were.
EDIT: If this works, I'll try to contribute the change back upstream.