Closed marrobi closed 5 years ago
Here's the log from a batch job - doesn't look like it ever actually completes. stderr.txt
A number of difference exceptions, sumamrised as follow:
We need to ensure these are handled appropriately. Not sure which cause the job to fail.
This should be fixed with #994
@j-coll Afraid more errors:
java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at java.lang.Runtime.exec(Runtime.java:620)
at java.lang.Runtime.exec(Runtime.java:485)
at org.opencb.commons.exec.Command.run(Command.java:93)
at org.opencb.opencga.storage.hadoop.variant.executors.ExternalMRExecutor.run(ExternalMRExecutor.java:49)
at org.opencb.opencga.storage.hadoop.variant.executors.ExternalMRExecutor.run(ExternalMRExecutor.java:39)
at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:71)
at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:52)
at org.opencb.opencga.storage.hadoop.variant.stats.HadoopMRVariantStatisticsManager.calculateStatistics(HadoopMRVariantStatisticsManager.java:70)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStats(VariantStorageEngine.java:491)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:545)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.index(VariantStorageEngine.java:341)
at org.opencb.opencga.storage.hadoop.variant.HadoopVariantStorageEngine.index(HadoopVariantStorageEngine.java:230)
at org.opencb.opencga.storage.core.manager.variant.operations.VariantFileIndexerStorageOperation.index(VariantFileIndexerStorageOperation.java:280)
at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:181)
at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:172)
at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.index(VariantCommandExecutor.java:303)
at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.execute(VariantCommandExecutor.java:115)
at org.opencb.opencga.app.cli.analysis.AnalysisMain.privateMain(AnalysisMain.java:98)
at org.opencb.opencga.app.cli.analysis.AnalysisMain.main(AnalysisMain.java:34)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 19 more
2019-02-06 14:00:55 [main] INFO MRExecutor:53 - ------------------------------------------------------
2019-02-06 14:00:55 [main] INFO MRExecutor:54 - Exit value: -1
2019-02-06 14:00:55 [main] INFO MRExecutor:55 - Total time: 0.006s
2019-02-06 14:00:55 [main] ERROR VariantFileIndexerStorageOperation:282 - Error executing INDEX
org.opencb.opencga.storage.core.exceptions.StoragePipelineException: Can't calculate stats.
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:547)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.index(VariantStorageEngine.java:341)
at org.opencb.opencga.storage.hadoop.variant.HadoopVariantStorageEngine.index(HadoopVariantStorageEngine.java:230)
at org.opencb.opencga.storage.core.manager.variant.operations.VariantFileIndexerStorageOperation.index(VariantFileIndexerStorageOperation.java:280)
at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:181)
at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:172)
at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.index(VariantCommandExecutor.java:303)
at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.execute(VariantCommandExecutor.java:115)
at org.opencb.opencga.app.cli.analysis.AnalysisMain.privateMain(AnalysisMain.java:98)
at org.opencb.opencga.app.cli.analysis.AnalysisMain.main(AnalysisMain.java:34)
Caused by: org.opencb.opencga.storage.core.exceptions.StorageEngineException: Error executing MapReduce for : "Calculate stats of cohorts [ALL]"
at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:58)
at org.opencb.opencga.storage.hadoop.variant.stats.HadoopMRVariantStatisticsManager.calculateStatistics(HadoopMRVariantStatisticsManager.java:70)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStats(VariantStorageEngine.java:491)
at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:545)
... 9 more```
Need to grab the hadoop binaries from the HBase cluster when we grab the config , and add it to the PATH.
binary should be in usr/hdp/current
@j-coll to confirm exact files to copy across
The files are located in /usr/hdp/current/hadoop-client
.
Warning: There are some symlinks in the folder. Create the tar with h
ssh ${HADOOP_NODE}
cd /usr/hdp/current/
tar zcvfh ~/hadoop-client.tar.gz hadoop-client/
scp ${HADOOP_NODE}:hadoop-client.tar.gz .
tar zxfv hadoop-client.tar.gz
# Get envvar HADOOP_HOME
. hadoop-client/conf/hadoop-env.sh
PATH=$PATH:$HADOOP_HOME/bin
sudo mkdir -p `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-client $HADOOP_HOME
@lawrencegripper so I think this bit:
ssh ${HADOOP_NODE}
cd /usr/hdp/current/
tar zcvfh ~/hadoop-client.tar.gz hadoop-client/
scp ${HADOOP_NODE}:hadoop-client.tar.gz .
tar zxfv hadoop-client.tar.gz
Wants to be in the init script. And...
# Get envvar HADOOP_HOME
. hadoop-client/conf/hadoop-env.sh
PATH=$PATH:$HADOOP_HOME/bin
sudo mkdir -p `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-client $HADOOP_HOME
probably in the batch script. Could extract in the batch but that means would run on each job - not sure if thats an issue. May be better though, or at least moving the files to the container/bin folder?
This makes sense to me, let me have a play and get this working manually then try and get this into a PR. What is the best way to repro this one? Follow @bart-jansen 's script and load an item from 1000 genomes then look at the task status in Batch?
Yes, or deploy the Azure branch (being aware of the t-shirt sizing and mongo issue), and create a task. Issue is this happens after an hour or two of the job. Not considered what the bottleneck is - might be faster with larger HBase cluster.
I should have a deployment up shortly, if you want to use that. I'm not sure how batch caches the images though, guess if revisions have a new tag, will be forced to pull. You can edit the task and rerun directly in batch with a new image.
@j-coll inside the hadoop-env.sh
script copied from the cluster the JAVA_HOME
is set:
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME_WARN_SUPPRESS=1
What is the best way to deal with this? Can we have two of the JVMs installed on the image? One for OpenCGA and one for Hadoop? Or should I just strip this out from the script before it is run to avoid it replacing the existing JAVA_HOME
?
The other bit that looks interesting is is how it looks to load additional libs that we won't have SCP'd onto the node. We'll end up with $HADOOP_HOME
pointing to /usr/lib/hadoop
and I was planning to symlink that folder as you suggested but that won't give /usr/hdp/current
or the specific versions referenced any files as nothing will be at that path.... Does this matter?
#Add libraries required by nodemanager
MAPREDUCE_LIBS=/usr/hdp/current/hadoop-mapreduce-client/*
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}
if [ -d "/usr/lib/tez" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/tez/*:/usr/lib/tez/lib/*:/etc/tez/conf
fi
# Setting path to hdfs command line
export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.5.3007-3/hadoop/libexec
Full file
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME_WARN_SUPPRESS=1
# Hadoop home directory
export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
# Hadoop Configuration Directory
#TODO: if env var set that can cause problems
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/2.6.5.3007-3/hadoop/conf}
# Path to jsvc required by secure HDP 2.0 datanode
export JSVC_HOME=/usr/lib/bigtop-utils
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE="1024"
export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms1024m"
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=namenode ${HADOOP_NAMENODE_OPTS}"
HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT -Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}"
HADOOP_TASKTRACKER_OPTS="-server -Xmx1024m -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}"
HADOOP_DATANODE_OPTS="-Xmx1024m -Dhadoop.security.logger=ERROR,DRFAS -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=datanode ${HADOOP_DATANODE_OPTS}"
HADOOP_BALANCER_OPTS="-server -Xmx1024m ${HADOOP_BALANCER_OPTS}"
export HADOOP_SECONDARYNAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps ${HADOOP_NAMENODE_INIT_HEAPSIZE} -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=secondarynamenode ${HADOOP_SECONDARYNAMENODE_OPTS}"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges
export HADOOP_SECURE_DN_USER=hdfs
# Extra ssh options. Empty by default.
export HADOOP_SSH_OPTS="-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR"
# Where log files are stored. $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/var/log/hadoop/$USER
# History server logs
export HADOOP_MAPRED_LOG_DIR=/var/log/hadoop-mapreduce/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop/$HADOOP_SECURE_DN_USER
# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1
# The directory where pid files are stored. /tmp by default.
export HADOOP_PID_DIR=/var/run/hadoop/$USER
export HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop/$HADOOP_SECURE_DN_USER
# History server pid
export HADOOP_MAPRED_PID_DIR=/var/run/hadoop-mapreduce/$USER
YARN_RESOURCEMANAGER_OPTS="-Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=resourcemanager"
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HADOOP_NICENESS=10
# Use libraries from standard classpath
JAVA_JDBC_LIBS=""
#Add libraries required by mysql connector
for jarFile in `ls /usr/share/java/*mysql* 2>/dev/null`
do
JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile
done
#Add libraries required by oracle connector
for jarFile in `ls /usr/share/java/*ojdbc* 2>/dev/null`
do
JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile
done
#Add libraries required by nodemanager
MAPREDUCE_LIBS=/usr/hdp/current/hadoop-mapreduce-client/*
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}
if [ -d "/usr/lib/tez" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/tez/*:/usr/lib/tez/lib/*:/etc/tez/conf
fi
# Setting path to hdfs command line
export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.5.3007-3/hadoop/libexec
#Mostly required for hadoop 2.0
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/usr/lib/hadoop/lib/native/Linux-amd64-64
# Enable ACLs on zookeper znodes if required
Regarding the JAVA_HOME
issue, it could be an issue in case that they add some special configuration to the JVM, but that will only be required for the servers, not for the client.
There is no issue (that I know) on having 2 JVMs in the same node. In my computer I have 4 or 5 different versions.
I was trying to avoid modifying the configuration files, but I prefer that rather than having 2 JVMs. Will simplify things.
Let's replace that JAVA_HOME
with the correct location in the node.
We may also need to replace the file hadoop-client/bin/hadoop
. The last line should point to the hadoop.distro
, probably with a path with the full version.
Example from our HDP-2.6.5:
#!/bin/bash
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/2.6.5.0-292/hadoop}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-/usr/hdp/2.6.5.0-292/hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-/usr/hdp/2.6.5.0-292/hadoop-yarn}
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HDP_VERSION=${HDP_VERSION:-2.6.5.0-292}
export HADOOP_OPTS="${HADOOP_OPTS} -Dhdp.version=${HDP_VERSION}"
exec /usr/hdp/2.6.5.0-292//hadoop/bin/hadoop.distro "$@"
This mixture of /usr/lib/hadoop
and /usr/hdp/current
is weird. I was expecting it to be /usr/hdp/current
everywhere.
We definitely need the MAPREDUCE_LIBS
.
We'll need to copy this two folders too from the hadoop node:
tar zcvfh ~/hadoop-mapreduce-client.tar.gz hadoop-mapreduce-client/
tar zcvfh ~/hadoop-yarn-client.tar.gz hadoop-yarn-client/
We can link them to both dirname $HADOOP_HOME
and /usr/hdp/current
(if different)
sudo ln -s $PWD/hadoop-mapreduce-client `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-yarn-client `dirname $HADOOP_HOME`
if [ "/usr/hdp/current" != `dirname $HADOOP_HOME` ] ; then
sudo mkdir -p /usr/hdp/current
sudo ln -s $PWD/hadoop-mapreduce-client /usr/hdp/current
sudo ln -s $PWD/hadoop-yarn-client /usr/hdp/current
fi
@j-coll been working on this but starting to feel very messy as pulling in /usr/libs
dirs referenced in the script too.
Would an option be to find the CLI install bundle which matches our HortonWorks environment and download and install this instead? I found a guide doing something similar here: https://docs.splunk.com/Documentation/HadoopConnect/1.2.5/DeployHadoopConnect/HadoopCLI
So here is a quick write up on where I've got too with this (PR #1135)
I SSH'd onto the HDInsights cluster and reviewed the sources for apt
. Under /etc/apt/sources.list.d/
there is a hdp.list
file. This contained a ubuntu repo which has the hadoop 3007-3
builds.
#VERSION_NUMBER=2.6.5.3007-3
deb https://private-repo-1.hortonworks.com/HDP/ubuntu16/2.x/updates/2.6.5.3007-3 HDP main
deb https://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/ubuntu16 HDP-UTILS main
I then used apt key list
and apt-key export "Jenkins (HDP Builds) <jenkin@hortonworks.com>"
to pull out the public key for the repository and saved that into the hdp-azure.pub
file with the docker container.
I then followed the hdp
install docs on how to install the required components and got the resulting
# Install all the Hortonworks Data Platform stuff we need
# hadoop etc
COPY ./opencga-app/app/scripts/docker/opencga-batch/hdp-azure.pub /tmp/
RUN apt-key add /tmp/hdp-azure.pub
RUN apt-get update && apt-get install apt-transport-https ca-certificates -y
RUN wget https://private-repo-1.hortonworks.com/HDP/ubuntu16/2.x/updates/2.6.5.3007-3/hdp.list -O /etc/apt/sources.list.d/hdp.list && apt-get update
RUN apt-get install zookeeper hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop-mapreduce hadoop-client openssl hbase \
libsnappy1 liblzo2-2 liblzo2-dev phoenix -y
# Node removed - libsnappy-dev as causes an issue
# Check hadoop is now on path
RUN hadoop
The outstanding questions are:
/opt/opencga/conf/hadoop
config files to be picked up by the binaries hdp
version to allow for easy update in the future. Quick update. We got this working by adding in a java_home
env-var. This then resulted in receiving this error
/usr/hdp/2.6.5.3007-3//hadoop/bin/hadoop.distro: line 180: /usr/lib/jvm/java-8-openjdk-amd64/bin/java: No such file or directory
It looks from the script like hadoop has a hard dependency on the OpenJDK so we added this like so
RUN apt-get install openjdk-8-jdk -y
After this we received the following error
log4j:ERROR Could not instantiate class [com.microsoft.log4jappender.EtwAppender].
java.lang.ClassNotFoundException: com.microsoft.log4jappender.EtwAppender
This was resolved by editing log4j.properties
in hadoop config as follows (removing filter and etw providers)
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# \"License\"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an \"AS IS\" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Define some default values that can be overridden by system properties
hbase.root.logger=INFO,console
hbase.security.logger=INFO,console
hbase.log.dir=.
hbase.log.file=hbase.log
# Define the root logger to the system property \"hbase.root.logger\".
log4j.rootLogger=${hbase.root.logger}
# Logging Threshold
log4j.threshold=INFO
log4j.appender.RRFA=org.apache.log4j.RollingFileAppender
log4j.appender.RFA.File=${hbase.log.dir}/${hbase.log.file}
log4j.appender.RFA.MaxFileSize=1024MB
log4j.appender.RFA.MaxBackupIndex=10
log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n
# Rolling File Appender properties
hbase.log.maxfilesize=1024MB
hbase.log.maxbackupindex=10
# Rolling File Appender
log4j.appender.RFA=org.apache.log4j.RollingFileAppender
log4j.appender.RFA.File=${hbase.log.dir}/${hbase.log.file}
log4j.appender.RFA.MaxFileSize=${hbase.log.maxfilesize}
log4j.appender.RFA.MaxBackupIndex=${hbase.log.maxbackupindex}
log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n
#
# Security audit appender
#
hbase.security.log.file=SecurityAuth.audit
hbase.security.log.maxfilesize=1024MB
hbase.security.log.maxbackupindex=10
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${hbase.log.dir}/${hbase.security.log.file}
log4j.appender.RFAS.MaxFileSize=${hbase.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hbase.security.log.maxbackupindex}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.category.SecurityLogger=${hbase.security.logger}
log4j.additivity.SecurityLogger=false
#log4j.logger.SecurityLogger.org.apache.hadoop.hbase.security.access.AccessController=TRACE
#
# Null Appender
#
log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender
#
# console
# Add \"console\" to rootlogger above if you want to use this
#
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n
# Custom Logging levels
log4j.logger.org.apache.zookeeper=INFO
#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
log4j.logger.org.apache.hadoop.hbase=DEBUG
# Make these two classes INFO-level. Make them DEBUG to see more zk debug.
log4j.logger.org.apache.hadoop.hbase.zookeeper.ZKUtil=INFO
log4j.logger.org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher=INFO
#log4j.logger.org.apache.hadoop.dfs=DEBUG
# Set this class to log INFO only otherwise its OTT
# Enable this to get detailed connection error/retry logging.
# log4j.logger.org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation=TRACE
# Uncomment this line to enable tracing on _every_ RPC call (this can be a lot of output)
#log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=DEBUG
# Uncomment the below if you want to remove logging of client region caching'
# and scan of .META. messages
# log4j.logger.org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation=INFO
# log4j.logger.org.apache.hadoop.hbase.client.MetaScanner=INFO
After this we received the following error
ls: Cannot run program "/usr/lib/hdinsight-common/scripts/decrypt.sh": error=2, No such file or directory
We attempted to follow this solution to this issue: https://kb.informatica.com/solution/23/pages/64/522404.aspx
With this present everything worked up until we received
opencga@7e1836968389:/opt/opencga$ hadoop --config /conf fs -ls
2019-02-12 15:14:29,997 WARN [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
ls: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD, https://opencgajtmfcxofaem6k.dfs.core.windows.net/opencga//?upn=false&action=getAccessControl&timeout=90
We're tried to install openssl
to see if this resolves the issue. It didn't.
This looks to be related to the missing scripts/decrypt.sh
as in core-site.xml
the keys used by Azure Blob appear to be encrypted and we think these are decrypted by that script. We attempted to use the primary key directly by editing the config and this failed with a slightly different error.
hadoop --config /conf fs -ls
2019-02-12 15:25:04,746 WARN [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
ls: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://opencgajtmfcxofaem6k.dfs.core.windows.net/opencga//?upn=false&action=getAccessControl&timeout=90
So it looks like the This request is not authorized to perform this operation.
is expected behavior.
We then tried to use hadoop --config /conf jar /conf/hadoop-mapreduce-examples-2.7.2.jar wordcount bob bob
to attempt a map reduce job.
Then we got this when attempting to run an example wordcount
using the client
pencga@18800bfc78c6:/opt/opencga$ hadoop --config /conf jar /conf/hadoop-mapreduce-examples-2.7.2.jar wordcount bob bob
2019-02-12 16:06:22,162 WARN [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
2019-02-12 16:06:22,506 INFO [main] service.AbstractService: Service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl failed in state INITED; cause: java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost
java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:439)
I think this may be because this was run on my devbox which isn't on the vnet.... it may work correctly on the clusters vnet.
So I'm seeing the same behaviour on the DaemonVM.
java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost
Looks like YARN config in yarn-site.xml
has the following properties set:
...
<property>
<name>yarn.timeline-service.webapp.address</name>
<value>headnodehost:8190</value>
</property>
...
<property>
<name>yarn.timeline-service.webapp.https.address</name>
<value>headnodehost:8190</value>
</property>
...
<property>
<name>yarn.timeline-service.address</name>
<value>headnodehost:10200</value>
</property>
...
<property>
<name>yarn.log.server.web-service.url</name>
<value>http://headnodehost:8188/ws/v1/applicationhistory</value>
</property>
...
The DNS names used in the hdfs-site.xml
such as hn1-cgahba.im4v1ebjwwzetffjudnxei3yug.ax.internal.cloudapp.net:8020
do resolve from the daemon but the headnodehost
in the yarn-site.xml
doesn't. I don't have a HDI box I can get on to see how that resolves but I suspect we might need to do a little magic to make sure that's accessible.
trying to run this directly on the HDI node with its own config. Current output is...
hadoop --config /etc/hadoop/conf jar hadoop-mapreduce-examples-2.7.2.jar wordcount /home/sshuser/text out
19/02/12 18:05:12 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.22:10200
19/02/12 18:05:12 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
19/02/12 18:05:12 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
19/02/12 18:05:13 INFO mapreduce.JobSubmitter: Cleaning up the staging area /mapreducestaging/sshuser/.staging/job_1549630099904_0003
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: abfs://opencga@opencgajtmfcxofaem6k.dfs.core.windows.net/home/sshuser/text
Trying to create the file via Storage Explorer but seems to be hanging at the moment.
nslookup headnodehost
on the HDI host doesn't resolve.
I think this might be due to the file uri scheme in HDInsight. I think they need to be of the scheme wasb:///example/jars/hadoop-mapreduce-examples.jar
(but for the input path not the example jar)
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -mkdir /test
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -put example.txt /test/example.txt
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -ls /test
Found 1 items
-rw-r--r-- 1 sshuser sshuser 0 2019-02-13 11:58 /test/example.txt
sshuser@hn0-cgahba:~$ hadoop --config /etc/hadoop/conf jar hadoop-mapreduce-examples-2.7.2.jar wordcount /test/example.txt out
19/02/13 11:58:36 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.22:10200
19/02/13 11:58:36 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
19/02/13 11:58:36 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
19/02/13 11:58:36 INFO input.FileInputFormat: Total input paths to process : 1
19/02/13 11:58:36 INFO mapreduce.JobSubmitter: number of splits:1
19/02/13 11:58:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1549630099904_0014
19/02/13 11:58:37 INFO impl.YarnClientImpl: Submitted application application_1549630099904_0014
19/02/13 11:58:37 INFO mapreduce.Job: The url to track the job: http://hn1-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net:8088/proxy/application_1549630099904_0014/
19/02/13 11:58:37 INFO mapreduce.Job: Running job: job_1549630099904_0014
19/02/13 11:58:48 INFO mapreduce.Job: Job job_1549630099904_0014 running in uber mode : false
19/02/13 11:58:48 INFO mapreduce.Job: map 0% reduce 0%
19/02/13 11:58:52 INFO mapreduce.Job: map 100% reduce 0%
19/02/13 11:58:57 INFO mapreduce.Job: map 100% reduce 100%
19/02/13 11:58:58 INFO mapreduce.Job: Job job_1549630099904_0014 completed successfully
19/02/13 11:58:58 INFO mapreduce.Job: Counters: 49
File System Counters
ABFS: Number of bytes read=138
ABFS: Number of bytes written=0
ABFS: Number of read operations=0
ABFS: Number of large read operations=0
ABFS: Number of write operations=0
FILE: Number of bytes read=6
FILE: Number of bytes written=327517
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=2502
Total time spent by all reduces in occupied slots (ms)=4796
Total time spent by all map tasks (ms)=2502
Total time spent by all reduce tasks (ms)=2398
Total vcore-milliseconds taken by all map tasks=2502
Total vcore-milliseconds taken by all reduce tasks=2398
Total megabyte-milliseconds taken by all map tasks=1921536
Total megabyte-milliseconds taken by all reduce tasks=3683328
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=138
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=153
CPU time spent (ms)=2320
Physical memory (bytes) snapshot=696377344
Virtual memory (bytes) snapshot=5163016192
Total committed heap usage (bytes)=602931200
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
Need to investigate whether zookeeper serves the headnodehost
value to hadoop
Connected to ZK and checked, it doesn't provide headnodehost
. However...
headnodehost
does actually resolve on the HDI node as it is in the /ect/hosts
file i.e.
10.0.0.22 hn0-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net hn0-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net. headnodehost hn0-cgahba headnodehost.
this doesn't work via nslookup
but does via ping
as the DNS resolution wasn't respecting the /etc/hosts
file.
headnodehost
is intended for use inside the HDI cluster only as their clients will retry if it fails in the event that the master of headnode has failed and it needs to attempt the secondary headnode.
A potential option might be... to create a script and alias it to hadoop
that takes in the same parameters as the hadoop CLI that scp
the jar up to the HDI node and then ssh
rmeote executes the hadoop CLI on the HDI node directly. This is a hack but would save us getting the CLI and modifying the config. It'd also allow us to retry if the node is unhealthy.
This would be an issue if the client needs to connect directly to storage to upload any local files or something like that.
Based on current work the following changes are needed in the See comment belowinit
container:
JAVA_HOME
to envs in docker file log4j.properties
file after downloading config via SCP
(Just tested quickly on the cluster and appears to fix the error and is simpler than updating it with sed
)core-site.xml
file Comment out:
<!-- property>
<name>fs.azure.account.keyprovider.opencgajtmfcxofaem6k.blob.core.windows.net</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property -->
<!-- property>
<name>fs.azure.account.keyprovider.opencgajtmfcxofaem6k.dfs.core.windows.net</name>
<value>org.apache.hadoop.fs.azurebfs.services.ShellDecryptionKeyProvider</value>
</property -->
and
<!-- property>
<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property -->
In the following blocks replace the DATAHERE
with the Azure storage key
<property>
<name>fs.azure.account.key.opencgajtmfcxofaem6k.blob.core.windows.net</name>
<value>DATA_HERE</value>
</property>
<property>
<name>fs.azure.account.key.opencgajtmfcxofaem6k.dfs.core.windows.net</name>
<value>DATA_HERE</value>
</property>
We're going to move to ssh'ing into the cluster and using this approach to run commands.
@j-coll is working on edits to make this work in the Java. It will pickup host
, user
and password
from the variant storage config and use these to run the commands on the cluster
@jjcollinge is updating the init container to inject these config values correctly and (may) also copy up the files need or @lawrencegripper (me) will pick this up.
@lawrencegripper is going to reach out to the HDInsights team to request confirmation on how/when the etc/host
files are updated and have a little bit of a look around myself
Regarding the list of files to copy to the cluster:
To simplify things, the files in the hadoop node should be in the same path as in the batch node. Otherwise, I can "fix" the command to execute with sed
.
Select the files to copy:
Option 1 : Simplest solution. Copy the whole opencga installation dir into the edge node. This can not fail!
Option 2 : Copy a hard-coded list of files:
find /opt/opencga/ -name opencga-storage-hadoop-core-*-jar-with-dependencies.jar
find /opt/opencga/libs/ -name "jackson-*-2.[0-9].[0-9].jar"
find /opt/opencga/libs/ -name "protobuf-java-*.jar"
find /opt/opencga/libs/ -name "avro-*.jar"
...jar-with-dependencies.jar
export BASEDIR=/opt/opencga
source $BASEDIR/conf/opencga-env.sh
echo $BASEDIR/opencga-storage-hadoop-core-*-jar-with-dependencies.jar echo $HADOOP_CLASSPATH | tr ":" "\n" | grep jar
Great, thanks. I'll aim for option 1 because I'm lazy it'd best to get the e2e working before optimizing
Resolved in #1156 and #1157 which are merged in #1164
Azure Batch redurns an exit code of 1, hence job show as a failure.