nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

job-level settings not being passed on to jobs #163

Open echeran opened 11 years ago

echeran commented 11 years ago

(as from the mailing list: https://groups.google.com/forum/#!topic/cascalog-user/Rq_O33VsDyc )

I've come across similar issues of the options for child JVMs specified in with-job-conf not "sticking". I experienced GC issues in a reducer of one of my Cascalog jobs for the first time last week. I found the with-job-conf macro and wrapped the query execution form with it, to no avail:

(let [snk-qry-by-chan (for [chan channels]
                          (channel-query chan))
        all-snk-qry-seq (apply concat snk-qry-by-chan)]
    ;; configure the MapReduce child JVM options to avoid GC Overhead Limit err
    (with-job-conf {"mapred.child.java.opts" "-XX:-UseGCOverheadLimit -Xmx4g"}
      ;; execute all of the queries in parallel
      (apply ?- all-snk-qry-seq)))

The relevant parts of my project.clj

  :dependencies [[org.clojure/clojure "1.5.1"]
                 [cascalog "1.10.1"]
                 [incanter "1.4.1"]]
  :repositories {"cloudera" "https://repository.cloudera.com/artifactory/cloudera-repos"}
  :profiles {:provided {:dependencies [[org.apache.hadoop/hadoop-core  "0.20.2-cdh3u5"]]}}

But from the logging output from the reducer in question, regardless of what I specified in with-job-conf, I always saw this:

2013-07-12 17:25:55,216 INFO cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1073741824

Further details:

I saw Robin's workaround, which seems to just modify the site-hadoop.xml. It would be great if the with-job-conf settings "stuck" so as not to have to tweak site settings for per-job needs (especially since I don't manage the Hadoop cluster).

mjwillson commented 9 years ago

I've noticed (perhaps?) related issues in pure Cascading. Configuration properties supplied to the FlowConnector don't always get passed into the JobConf, the behaviour seems inconsistent and unpredictable. Would be good to have visibility and explicit guaranteed control over the JobConf.