opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

Azure Batch tasks return a failure #1113

Closed marrobi closed 5 years ago

marrobi commented 5 years ago

Azure Batch redurns an exit code of 1, hence job show as a failure.

marrobi commented 5 years ago

Here's the log from a batch job - doesn't look like it ever actually completes. stderr.txt

A number of difference exceptions, sumamrised as follow:

We need to ensure these are handled appropriately. Not sure which cause the job to fail.

j-coll commented 5 years ago

This should be fixed with #994

marrobi commented 5 years ago

@j-coll Afraid more errors:

stderr.txt


java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at java.lang.Runtime.exec(Runtime.java:620)
    at java.lang.Runtime.exec(Runtime.java:485)
    at org.opencb.commons.exec.Command.run(Command.java:93)
    at org.opencb.opencga.storage.hadoop.variant.executors.ExternalMRExecutor.run(ExternalMRExecutor.java:49)
    at org.opencb.opencga.storage.hadoop.variant.executors.ExternalMRExecutor.run(ExternalMRExecutor.java:39)
    at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:71)
    at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:52)
    at org.opencb.opencga.storage.hadoop.variant.stats.HadoopMRVariantStatisticsManager.calculateStatistics(HadoopMRVariantStatisticsManager.java:70)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStats(VariantStorageEngine.java:491)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:545)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.index(VariantStorageEngine.java:341)
    at org.opencb.opencga.storage.hadoop.variant.HadoopVariantStorageEngine.index(HadoopVariantStorageEngine.java:230)
    at org.opencb.opencga.storage.core.manager.variant.operations.VariantFileIndexerStorageOperation.index(VariantFileIndexerStorageOperation.java:280)
    at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:181)
    at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:172)
    at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.index(VariantCommandExecutor.java:303)
    at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.execute(VariantCommandExecutor.java:115)
    at org.opencb.opencga.app.cli.analysis.AnalysisMain.privateMain(AnalysisMain.java:98)
    at org.opencb.opencga.app.cli.analysis.AnalysisMain.main(AnalysisMain.java:34)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 19 more
2019-02-06 14:00:55 [main] INFO  MRExecutor:53 - ------------------------------------------------------
2019-02-06 14:00:55 [main] INFO  MRExecutor:54 - Exit value: -1
2019-02-06 14:00:55 [main] INFO  MRExecutor:55 - Total time: 0.006s
2019-02-06 14:00:55 [main] ERROR VariantFileIndexerStorageOperation:282 - Error executing INDEX
org.opencb.opencga.storage.core.exceptions.StoragePipelineException: Can't calculate stats.
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:547)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.index(VariantStorageEngine.java:341)
    at org.opencb.opencga.storage.hadoop.variant.HadoopVariantStorageEngine.index(HadoopVariantStorageEngine.java:230)
    at org.opencb.opencga.storage.core.manager.variant.operations.VariantFileIndexerStorageOperation.index(VariantFileIndexerStorageOperation.java:280)
    at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:181)
    at org.opencb.opencga.storage.core.manager.variant.VariantStorageManager.index(VariantStorageManager.java:172)
    at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.index(VariantCommandExecutor.java:303)
    at org.opencb.opencga.app.cli.analysis.executors.VariantCommandExecutor.execute(VariantCommandExecutor.java:115)
    at org.opencb.opencga.app.cli.analysis.AnalysisMain.privateMain(AnalysisMain.java:98)
    at org.opencb.opencga.app.cli.analysis.AnalysisMain.main(AnalysisMain.java:34)
Caused by: org.opencb.opencga.storage.core.exceptions.StorageEngineException: Error executing MapReduce for : "Calculate stats of cohorts [ALL]"
    at org.opencb.opencga.storage.hadoop.variant.executors.MRExecutor.run(MRExecutor.java:58)
    at org.opencb.opencga.storage.hadoop.variant.stats.HadoopMRVariantStatisticsManager.calculateStatistics(HadoopMRVariantStatisticsManager.java:70)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStats(VariantStorageEngine.java:491)
    at org.opencb.opencga.storage.core.variant.VariantStorageEngine.calculateStatsForLoadedFiles(VariantStorageEngine.java:545)
    ... 9 more```
martinpeck commented 5 years ago

Need to grab the hadoop binaries from the HBase cluster when we grab the config , and add it to the PATH.

martinpeck commented 5 years ago

binary should be in usr/hdp/current @j-coll to confirm exact files to copy across

j-coll commented 5 years ago

The files are located in /usr/hdp/current/hadoop-client. Warning: There are some symlinks in the folder. Create the tar with h

Get the hadoop-client

ssh ${HADOOP_NODE}

cd /usr/hdp/current/
tar zcvfh ~/hadoop-client.tar.gz hadoop-client/

Link and configure PATH

scp ${HADOOP_NODE}:hadoop-client.tar.gz .
tar zxfv hadoop-client.tar.gz
# Get envvar HADOOP_HOME
. hadoop-client/conf/hadoop-env.sh

PATH=$PATH:$HADOOP_HOME/bin

sudo mkdir -p `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-client $HADOOP_HOME
marrobi commented 5 years ago

@lawrencegripper so I think this bit:

ssh ${HADOOP_NODE}

cd /usr/hdp/current/
tar zcvfh ~/hadoop-client.tar.gz hadoop-client/
scp ${HADOOP_NODE}:hadoop-client.tar.gz .
tar zxfv hadoop-client.tar.gz

Wants to be in the init script. And...

# Get envvar HADOOP_HOME
. hadoop-client/conf/hadoop-env.sh

PATH=$PATH:$HADOOP_HOME/bin

sudo mkdir -p `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-client $HADOOP_HOME

probably in the batch script. Could extract in the batch but that means would run on each job - not sure if thats an issue. May be better though, or at least moving the files to the container/bin folder?

lawrencegripper commented 5 years ago

This makes sense to me, let me have a play and get this working manually then try and get this into a PR. What is the best way to repro this one? Follow @bart-jansen 's script and load an item from 1000 genomes then look at the task status in Batch?

marrobi commented 5 years ago

Yes, or deploy the Azure branch (being aware of the t-shirt sizing and mongo issue), and create a task. Issue is this happens after an hour or two of the job. Not considered what the bottleneck is - might be faster with larger HBase cluster.

I should have a deployment up shortly, if you want to use that. I'm not sure how batch caches the images though, guess if revisions have a new tag, will be forced to pull. You can edit the task and rerun directly in batch with a new image.

lawrencegripper commented 5 years ago

@j-coll inside the hadoop-env.sh script copied from the cluster the JAVA_HOME is set:

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME_WARN_SUPPRESS=1

What is the best way to deal with this? Can we have two of the JVMs installed on the image? One for OpenCGA and one for Hadoop? Or should I just strip this out from the script before it is run to avoid it replacing the existing JAVA_HOME?

The other bit that looks interesting is is how it looks to load additional libs that we won't have SCP'd onto the node. We'll end up with $HADOOP_HOME pointing to /usr/lib/hadoop and I was planning to symlink that folder as you suggested but that won't give /usr/hdp/current or the specific versions referenced any files as nothing will be at that path.... Does this matter?

#Add libraries required by nodemanager
MAPREDUCE_LIBS=/usr/hdp/current/hadoop-mapreduce-client/*
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}

if [ -d "/usr/lib/tez" ]; then
  export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/tez/*:/usr/lib/tez/lib/*:/etc/tez/conf
fi

# Setting path to hdfs command line
export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.5.3007-3/hadoop/libexec

Full file

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME_WARN_SUPPRESS=1

# Hadoop home directory
export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}

# Hadoop Configuration Directory
#TODO: if env var set that can cause problems
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/2.6.5.3007-3/hadoop/conf}

# Path to jsvc required by secure HDP 2.0 datanode
export JSVC_HOME=/usr/lib/bigtop-utils

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE="1024"

export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms1024m"

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=namenode ${HADOOP_NAMENODE_OPTS}"
HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT -Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}"

HADOOP_TASKTRACKER_OPTS="-server -Xmx1024m -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}"
HADOOP_DATANODE_OPTS="-Xmx1024m -Dhadoop.security.logger=ERROR,DRFAS -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=datanode ${HADOOP_DATANODE_OPTS}"
HADOOP_BALANCER_OPTS="-server -Xmx1024m ${HADOOP_BALANCER_OPTS}"

export HADOOP_SECONDARYNAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps ${HADOOP_NAMENODE_INIT_HEAPSIZE} -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=secondarynamenode ${HADOOP_SECONDARYNAMENODE_OPTS}"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges
export HADOOP_SECURE_DN_USER=hdfs

# Extra ssh options.  Empty by default.
export HADOOP_SSH_OPTS="-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR"

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/var/log/hadoop/$USER

# History server logs
export HADOOP_MAPRED_LOG_DIR=/var/log/hadoop-mapreduce/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop/$HADOOP_SECURE_DN_USER

# File naming remote slave hosts.  $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync'd from.  Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands.  Unset by default.  This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.
export HADOOP_PID_DIR=/var/run/hadoop/$USER
export HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop/$HADOOP_SECURE_DN_USER

# History server pid
export HADOOP_MAPRED_PID_DIR=/var/run/hadoop-mapreduce/$USER

YARN_RESOURCEMANAGER_OPTS="-Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY -Dwhitelist.filename=core-whitelist.res,coremanual-whitelist.res -Dcomponent=resourcemanager"

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes.  See 'man nice'.

# export HADOOP_NICENESS=10

# Use libraries from standard classpath
JAVA_JDBC_LIBS=""
#Add libraries required by mysql connector
for jarFile in `ls /usr/share/java/*mysql* 2>/dev/null`
do
  JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile
done
#Add libraries required by oracle connector
for jarFile in `ls /usr/share/java/*ojdbc* 2>/dev/null`
do
  JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile
done
#Add libraries required by nodemanager
MAPREDUCE_LIBS=/usr/hdp/current/hadoop-mapreduce-client/*
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}

if [ -d "/usr/lib/tez" ]; then
  export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/tez/*:/usr/lib/tez/lib/*:/etc/tez/conf
fi

# Setting path to hdfs command line
export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.5.3007-3/hadoop/libexec

#Mostly required for hadoop 2.0
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/usr/lib/hadoop/lib/native/Linux-amd64-64

# Enable ACLs on zookeper znodes if required
j-coll commented 5 years ago

Regarding the JAVA_HOME issue, it could be an issue in case that they add some special configuration to the JVM, but that will only be required for the servers, not for the client.

There is no issue (that I know) on having 2 JVMs in the same node. In my computer I have 4 or 5 different versions. I was trying to avoid modifying the configuration files, but I prefer that rather than having 2 JVMs. Will simplify things. Let's replace that JAVA_HOME with the correct location in the node.

We may also need to replace the file hadoop-client/bin/hadoop. The last line should point to the hadoop.distro, probably with a path with the full version.

Example from our HDP-2.6.5:

#!/bin/bash

export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/2.6.5.0-292/hadoop}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-/usr/hdp/2.6.5.0-292/hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-/usr/hdp/2.6.5.0-292/hadoop-yarn}
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HDP_VERSION=${HDP_VERSION:-2.6.5.0-292}
export HADOOP_OPTS="${HADOOP_OPTS} -Dhdp.version=${HDP_VERSION}"

exec /usr/hdp/2.6.5.0-292//hadoop/bin/hadoop.distro "$@"

This mixture of /usr/lib/hadoop and /usr/hdp/current is weird. I was expecting it to be /usr/hdp/current everywhere. We definitely need the MAPREDUCE_LIBS.

We'll need to copy this two folders too from the hadoop node:

tar zcvfh ~/hadoop-mapreduce-client.tar.gz hadoop-mapreduce-client/
tar zcvfh ~/hadoop-yarn-client.tar.gz hadoop-yarn-client/

We can link them to both dirname $HADOOP_HOME and /usr/hdp/current (if different)

sudo ln -s $PWD/hadoop-mapreduce-client `dirname $HADOOP_HOME`
sudo ln -s $PWD/hadoop-yarn-client `dirname $HADOOP_HOME`
if [ "/usr/hdp/current" !=  `dirname $HADOOP_HOME` ] ; then 
   sudo mkdir -p /usr/hdp/current
   sudo ln -s $PWD/hadoop-mapreduce-client /usr/hdp/current
   sudo ln -s $PWD/hadoop-yarn-client /usr/hdp/current
fi
lawrencegripper commented 5 years ago

@j-coll been working on this but starting to feel very messy as pulling in /usr/libs dirs referenced in the script too.

Would an option be to find the CLI install bundle which matches our HortonWorks environment and download and install this instead? I found a guide doing something similar here: https://docs.splunk.com/Documentation/HadoopConnect/1.2.5/DeployHadoopConnect/HadoopCLI

lawrencegripper commented 5 years ago

So here is a quick write up on where I've got too with this (PR #1135)

I SSH'd onto the HDInsights cluster and reviewed the sources for apt. Under /etc/apt/sources.list.d/ there is a hdp.list file. This contained a ubuntu repo which has the hadoop 3007-3 builds.

#VERSION_NUMBER=2.6.5.3007-3
deb https://private-repo-1.hortonworks.com/HDP/ubuntu16/2.x/updates/2.6.5.3007-3 HDP main
deb https://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/ubuntu16 HDP-UTILS main

I then used apt key list and apt-key export "Jenkins (HDP Builds) <jenkin@hortonworks.com>" to pull out the public key for the repository and saved that into the hdp-azure.pub file with the docker container.

I then followed the hdp install docs on how to install the required components and got the resulting

# Install all the Hortonworks Data Platform stuff we need
# hadoop etc
COPY ./opencga-app/app/scripts/docker/opencga-batch/hdp-azure.pub /tmp/
RUN apt-key add /tmp/hdp-azure.pub
RUN apt-get update && apt-get install apt-transport-https ca-certificates -y
RUN wget https://private-repo-1.hortonworks.com/HDP/ubuntu16/2.x/updates/2.6.5.3007-3/hdp.list -O /etc/apt/sources.list.d/hdp.list && apt-get update
RUN apt-get install zookeeper hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop-mapreduce hadoop-client openssl hbase \
 libsnappy1 liblzo2-2 liblzo2-dev phoenix -y

 # Node removed - libsnappy-dev as causes an issue

# Check hadoop is now on path
RUN hadoop

The outstanding questions are:

  1. Does this build work reliably
  2. Do we need to configure the /opt/opencga/conf/hadoop config files to be picked up by the binaries
  3. Ideally we'd refactor this to have a variable used for the hdp version to allow for easy update in the future.
  4. Does this key change regularly
  5. Does this fix the problem
lawrencegripper commented 5 years ago

Quick update. We got this working by adding in a java_home env-var. This then resulted in receiving this error

/usr/hdp/2.6.5.3007-3//hadoop/bin/hadoop.distro: line 180: /usr/lib/jvm/java-8-openjdk-amd64/bin/java: No such file or directory  

It looks from the script like hadoop has a hard dependency on the OpenJDK so we added this like so

RUN apt-get install openjdk-8-jdk -y

After this we received the following error

log4j:ERROR Could not instantiate class [com.microsoft.log4jappender.EtwAppender].
java.lang.ClassNotFoundException: com.microsoft.log4jappender.EtwAppender

This was resolved by editing log4j.properties in hadoop config as follows (removing filter and etw providers)

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# \"License\"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an \"AS IS\" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Define some default values that can be overridden by system properties
hbase.root.logger=INFO,console
hbase.security.logger=INFO,console
hbase.log.dir=.
hbase.log.file=hbase.log

# Define the root logger to the system property \"hbase.root.logger\".
log4j.rootLogger=${hbase.root.logger}

# Logging Threshold
log4j.threshold=INFO

log4j.appender.RRFA=org.apache.log4j.RollingFileAppender
log4j.appender.RFA.File=${hbase.log.dir}/${hbase.log.file}
log4j.appender.RFA.MaxFileSize=1024MB
log4j.appender.RFA.MaxBackupIndex=10
log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n

# Rolling File Appender properties
hbase.log.maxfilesize=1024MB
hbase.log.maxbackupindex=10

# Rolling File Appender
log4j.appender.RFA=org.apache.log4j.RollingFileAppender
log4j.appender.RFA.File=${hbase.log.dir}/${hbase.log.file}

log4j.appender.RFA.MaxFileSize=${hbase.log.maxfilesize}
log4j.appender.RFA.MaxBackupIndex=${hbase.log.maxbackupindex}

log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n

#
# Security audit appender
#
hbase.security.log.file=SecurityAuth.audit
hbase.security.log.maxfilesize=1024MB
hbase.security.log.maxbackupindex=10
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${hbase.log.dir}/${hbase.security.log.file}
log4j.appender.RFAS.MaxFileSize=${hbase.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hbase.security.log.maxbackupindex}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.category.SecurityLogger=${hbase.security.logger}
log4j.additivity.SecurityLogger=false
#log4j.logger.SecurityLogger.org.apache.hadoop.hbase.security.access.AccessController=TRACE

#
# Null Appender
#
log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender

#
# console
# Add \"console\" to rootlogger above if you want to use this
#
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %c{2}: %m%n

# Custom Logging levels

log4j.logger.org.apache.zookeeper=INFO
#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
log4j.logger.org.apache.hadoop.hbase=DEBUG
# Make these two classes INFO-level. Make them DEBUG to see more zk debug.
log4j.logger.org.apache.hadoop.hbase.zookeeper.ZKUtil=INFO
log4j.logger.org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher=INFO
#log4j.logger.org.apache.hadoop.dfs=DEBUG
# Set this class to log INFO only otherwise its OTT
# Enable this to get detailed connection error/retry logging.
# log4j.logger.org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation=TRACE

# Uncomment this line to enable tracing on _every_ RPC call (this can be a lot of output)
#log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=DEBUG

# Uncomment the below if you want to remove logging of client region caching'
# and scan of .META. messages
# log4j.logger.org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation=INFO
# log4j.logger.org.apache.hadoop.hbase.client.MetaScanner=INFO

After this we received the following error

ls: Cannot run program "/usr/lib/hdinsight-common/scripts/decrypt.sh": error=2, No such file or directory

We attempted to follow this solution to this issue: https://kb.informatica.com/solution/23/pages/64/522404.aspx

With this present everything worked up until we received

opencga@7e1836968389:/opt/opencga$ hadoop --config /conf fs -ls
2019-02-12 15:14:29,997 WARN  [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
ls: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD, https://opencgajtmfcxofaem6k.dfs.core.windows.net/opencga//?upn=false&action=getAccessControl&timeout=90

We're tried to install openssl to see if this resolves the issue. It didn't.

This looks to be related to the missing scripts/decrypt.sh as in core-site.xml the keys used by Azure Blob appear to be encrypted and we think these are decrypted by that script. We attempted to use the primary key directly by editing the config and this failed with a slightly different error.

hadoop --config /conf fs -ls
2019-02-12 15:25:04,746 WARN  [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
ls: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://opencgajtmfcxofaem6k.dfs.core.windows.net/opencga//?upn=false&action=getAccessControl&timeout=90
lawrencegripper commented 5 years ago

So it looks like the This request is not authorized to perform this operation. is expected behavior.

We then tried to use hadoop --config /conf jar /conf/hadoop-mapreduce-examples-2.7.2.jar wordcount bob bob to attempt a map reduce job.

Then we got this when attempting to run an example wordcount using the client

pencga@18800bfc78c6:/opt/opencga$ hadoop --config /conf jar /conf/hadoop-mapreduce-examples-2.7.2.jar wordcount bob bob
2019-02-12 16:06:22,162 WARN  [main] utils.SSLSocketFactoryEx: Failed to load OpenSSL. Falling back to the JSSE default.
2019-02-12 16:06:22,506 INFO  [main] service.AbstractService: Service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl failed in state INITED; cause: java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost
java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:439)

I think this may be because this was run on my devbox which isn't on the vnet.... it may work correctly on the clusters vnet.

jjcollinge commented 5 years ago

So I'm seeing the same behaviour on the DaemonVM.

java.lang.IllegalArgumentException: java.net.UnknownHostException: headnodehost

Looks like YARN config in yarn-site.xml has the following properties set:

...
<property>
      <name>yarn.timeline-service.webapp.address</name>
      <value>headnodehost:8190</value>
    </property>
...
<property>
      <name>yarn.timeline-service.webapp.https.address</name>
      <value>headnodehost:8190</value>
    </property>
...
 <property>
      <name>yarn.timeline-service.address</name>
      <value>headnodehost:10200</value>
    </property>
...
<property>
      <name>yarn.log.server.web-service.url</name>
      <value>http://headnodehost:8188/ws/v1/applicationhistory</value>
    </property>
...

The DNS names used in the hdfs-site.xml such as hn1-cgahba.im4v1ebjwwzetffjudnxei3yug.ax.internal.cloudapp.net:8020 do resolve from the daemon but the headnodehost in the yarn-site.xml doesn't. I don't have a HDI box I can get on to see how that resolves but I suspect we might need to do a little magic to make sure that's accessible.

jjcollinge commented 5 years ago

trying to run this directly on the HDI node with its own config. Current output is...

hadoop --config /etc/hadoop/conf jar hadoop-mapreduce-examples-2.7.2.jar wordcount /home/sshuser/text out
19/02/12 18:05:12 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.22:10200
19/02/12 18:05:12 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
19/02/12 18:05:12 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
19/02/12 18:05:13 INFO mapreduce.JobSubmitter: Cleaning up the staging area /mapreducestaging/sshuser/.staging/job_1549630099904_0003
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: abfs://opencga@opencgajtmfcxofaem6k.dfs.core.windows.net/home/sshuser/text

Trying to create the file via Storage Explorer but seems to be hanging at the moment.

nslookup headnodehost on the HDI host doesn't resolve.

lawrencegripper commented 5 years ago

I think this might be due to the file uri scheme in HDInsight. I think they need to be of the scheme wasb:///example/jars/hadoop-mapreduce-examples.jar (but for the input path not the example jar)

jjcollinge commented 5 years ago
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -mkdir /test
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -put example.txt /test/example.txt
sshuser@hn0-cgahba:~$ hdfs --config /etc/hadoop/conf/ dfs -ls /test
Found 1 items
-rw-r--r--   1 sshuser sshuser          0 2019-02-13 11:58 /test/example.txt
sshuser@hn0-cgahba:~$ hadoop --config /etc/hadoop/conf jar hadoop-mapreduce-examples-2.7.2.jar wordcount /test/example.txt out
19/02/13 11:58:36 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.22:10200
19/02/13 11:58:36 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
19/02/13 11:58:36 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
19/02/13 11:58:36 INFO input.FileInputFormat: Total input paths to process : 1
19/02/13 11:58:36 INFO mapreduce.JobSubmitter: number of splits:1
19/02/13 11:58:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1549630099904_0014
19/02/13 11:58:37 INFO impl.YarnClientImpl: Submitted application application_1549630099904_0014
19/02/13 11:58:37 INFO mapreduce.Job: The url to track the job: http://hn1-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net:8088/proxy/application_1549630099904_0014/
19/02/13 11:58:37 INFO mapreduce.Job: Running job: job_1549630099904_0014
19/02/13 11:58:48 INFO mapreduce.Job: Job job_1549630099904_0014 running in uber mode : false
19/02/13 11:58:48 INFO mapreduce.Job:  map 0% reduce 0%
19/02/13 11:58:52 INFO mapreduce.Job:  map 100% reduce 0%
19/02/13 11:58:57 INFO mapreduce.Job:  map 100% reduce 100%
19/02/13 11:58:58 INFO mapreduce.Job: Job job_1549630099904_0014 completed successfully
19/02/13 11:58:58 INFO mapreduce.Job: Counters: 49
        File System Counters
                ABFS: Number of bytes read=138
                ABFS: Number of bytes written=0
                ABFS: Number of read operations=0
                ABFS: Number of large read operations=0
                ABFS: Number of write operations=0
                FILE: Number of bytes read=6
                FILE: Number of bytes written=327517
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2502
                Total time spent by all reduces in occupied slots (ms)=4796
                Total time spent by all map tasks (ms)=2502
                Total time spent by all reduce tasks (ms)=2398
                Total vcore-milliseconds taken by all map tasks=2502
                Total vcore-milliseconds taken by all reduce tasks=2398
                Total megabyte-milliseconds taken by all map tasks=1921536
                Total megabyte-milliseconds taken by all reduce tasks=3683328
        Map-Reduce Framework
                Map input records=0
                Map output records=0
                Map output bytes=0
                Map output materialized bytes=6
                Input split bytes=138
                Combine input records=0
                Combine output records=0
                Reduce input groups=0
                Reduce shuffle bytes=6
                Reduce input records=0
                Reduce output records=0
                Spilled Records=0
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=153
                CPU time spent (ms)=2320
                Physical memory (bytes) snapshot=696377344
                Virtual memory (bytes) snapshot=5163016192
                Total committed heap usage (bytes)=602931200
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0

Need to investigate whether zookeeper serves the headnodehost value to hadoop

jjcollinge commented 5 years ago

Connected to ZK and checked, it doesn't provide headnodehost. However...

headnodehost does actually resolve on the HDI node as it is in the /ect/hosts file i.e.

10.0.0.22       hn0-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net  hn0-cgahba.lagmfxvytx3unc1qt1zkekg32a.ax.internal.cloudapp.net. headnodehost    hn0-cgahba      headnodehost.

this doesn't work via nslookup but does via ping as the DNS resolution wasn't respecting the /etc/hosts file.

jjcollinge commented 5 years ago

headnodehost is intended for use inside the HDI cluster only as their clients will retry if it fails in the event that the master of headnode has failed and it needs to attempt the secondary headnode. A potential option might be... to create a script and alias it to hadoop that takes in the same parameters as the hadoop CLI that scp the jar up to the HDI node and then ssh rmeote executes the hadoop CLI on the HDI node directly. This is a hack but would save us getting the CLI and modifying the config. It'd also allow us to retry if the node is unhealthy.

This would be an issue if the client needs to connect directly to storage to upload any local files or something like that.

lawrencegripper commented 5 years ago

Based on current work the following changes are needed in the init container: See comment below

Comment out:

    <!-- property>
      <name>fs.azure.account.keyprovider.opencgajtmfcxofaem6k.blob.core.windows.net</name>
      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
    </property -->

    <!-- property>
      <name>fs.azure.account.keyprovider.opencgajtmfcxofaem6k.dfs.core.windows.net</name>
      <value>org.apache.hadoop.fs.azurebfs.services.ShellDecryptionKeyProvider</value>
    </property -->

and

    <!-- property>
      <name>fs.azure.shellkeyprovider.script</name>
      <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
    </property -->

In the following blocks replace the DATAHERE with the Azure storage key

    <property>
      <name>fs.azure.account.key.opencgajtmfcxofaem6k.blob.core.windows.net</name>
      <value>DATA_HERE</value>
    </property>

    <property>
      <name>fs.azure.account.key.opencgajtmfcxofaem6k.dfs.core.windows.net</name>
      <value>DATA_HERE</value>
    </property>
lawrencegripper commented 5 years ago

We're going to move to ssh'ing into the cluster and using this approach to run commands.

@j-coll is working on edits to make this work in the Java. It will pickup host, user and password from the variant storage config and use these to run the commands on the cluster

@jjcollinge is updating the init container to inject these config values correctly and (may) also copy up the files need or @lawrencegripper (me) will pick this up.

@lawrencegripper is going to reach out to the HDInsights team to request confirmation on how/when the etc/host files are updated and have a little bit of a look around myself

j-coll commented 5 years ago

Regarding the list of files to copy to the cluster:

To simplify things, the files in the hadoop node should be in the same path as in the batch node. Otherwise, I can "fix" the command to execute with sed.

Select the files to copy:

find /opt/opencga/ -name opencga-storage-hadoop-core-*-jar-with-dependencies.jar
find /opt/opencga/libs/ -name "jackson-*-2.[0-9].[0-9].jar"
find /opt/opencga/libs/ -name "protobuf-java-*.jar"
find /opt/opencga/libs/ -name "avro-*.jar"

Files to copy

echo $BASEDIR/opencga-storage-hadoop-core-*-jar-with-dependencies.jar echo $HADOOP_CLASSPATH | tr ":" "\n" | grep jar

jjcollinge commented 5 years ago

Great, thanks. I'll aim for option 1 because I'm lazy it'd best to get the e2e working before optimizing

lawrencegripper commented 5 years ago

Resolved in #1156 and #1157 which are merged in #1164