ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

For MrJob, make unpacking archives optional #93

Open anjackson opened 2 years ago

anjackson commented 2 years ago

I'm developing a fork of MrJob that makes unpacking archives optional, here: https://github.com/ukwa/mrjob/tree/make-unpacking-archives-optional

It should be possible to install this into a venv using:

pip install -U git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional

If that works, then update the MrJob config as per the updated docs:

runners:
    hadoop:
        unpack_archives: false

Running the job with this configuration should skip the unpacking-archives step and leave the files as they were.

EDIT: If this works, I'll try to contribute the change back upstream.

anjackson commented 2 years ago

Can you try this out @GilHoggarth and see if it works?

GilHoggarth commented 2 years ago

The installation line fails to pull in the patch, which I guess is due to the latest policy changes around github access.

FYI, I'm running:

(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install -U git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
Collecting git+https://github.com/ukwa/mrjob.git@make-unpacking-archives-optional
  Cloning https://github.com/ukwa/mrjob.git (to revision make-unpacking-archives-optional) to /tmp/pip-req-build-jjqmh0b4
fatal: unable to access 'https://github.com/ukwa/mrjob.git/': Failed connect to github.com:443; Connection timed out
Command "git clone -q https://github.com/ukwa/mrjob.git /tmp/pip-req-build-jjqmh0b4" failed with error code 128 in None

This installation works with mrjob:

(venv) [hdfsadmin@nlsh3httpfs generate_checksums]$ python3 -m pip --proxy http://explorer.bl.uk:3127/ install mrjob
Requirement already satisfied: mrjob in ./venv/lib/python3.7/site-packages (0.7.4)
Requirement already satisfied: PyYAML>=3.10 in ./venv/lib/python3.7/site-packages (from mrjob) (6.0)
GilHoggarth commented 2 years ago

Hacking the changes directly into bin.py, I now get:

Traceback (most recent call last):
  File "generate_checksums.py", line 20, in <module>
    MRGenerateChecksum.run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 616, in run
    cls().execute()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 687, in execute
    self.run_job()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/job.py", line 636, in run_job
    runner.run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/runner.py", line 503, in run
    self._run()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/hadoop.py", line 326, in _run
    self._create_setup_wrapper_scripts()
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 446, in _create_setup_wrapper_scripts
    manifest=True)
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 495, in _write_setup_script
    setup, manifest=manifest, wrap_python=wrap_python)
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 595, in _setup_wrapper_script_content
    lines.extend(self._manifest_download_content())
  File "/home/hdfsadmin/generate_checksums/venv/lib/python3.7/site-packages/mrjob/bin.py", line 693, in _manifest_download_content
    if self._opts['unpack_archives']:
KeyError: 'unpack_archives'

Quite understandably, you'll expect I made a mess of adding your code!

GilHoggarth commented 2 years ago

If I change line 693 to if 'unpack_archives' in self._opts and self._opts['unpack_archives'] != False: the mrjob now runs, but eventually fails as an mr job:

map 0% reduce 0%
  Task Id : attempt_1645461135252_47322_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

  Task Id : attempt_1645461135252_47322_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

  Task Id : attempt_1645461135252_47322_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

   map 100% reduce 0%
  Job job_1645461135252_47322 failed with state FAILED due to: Task failed task_1645461135252_47322_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0
anjackson commented 2 years ago

Installation seems to work if the https_proxy environment variable is set appropriately.

There were a few issues with the implementation, but it seems to work okay now. As per https://github.com/ukwa/mrjob/commit/e3901a21b397374ca001a974c95bb544eba6bb61

anjackson commented 2 years ago

I've opened a PR (https://github.com/Yelp/mrjob/pull/2215) but we can just install our branch for now.

GilHoggarth commented 2 years ago

Installed via pip and seen to be working.

GilHoggarth commented 2 years ago

For the purpose of the our hadoop data migration, this patch works successfully. However, you might wish to keep this ticket open whilst the patch is waiting to be included upstream. Consequently, I'm unassigning myself from this ticket

anjackson commented 1 year ago

Hm, attempted to request review in https://groups.google.com/g/mrjob but my post isn't turning up. Unless I messed up posting there somehow?