Executing Dynamometer workload

novosibman commented 4 years ago

Hi,

I collected all prerequisites (fsimage, audit log) and prepared local environment (accompanying hdfs, separate yarn manager) according to Dynamometer readme and tried to start workload scripts. Tried Hadoop versions: 2.7.4 and 2.8.4.

${DYN_HOME}/bin/upload-fsimage.sh 0894 ${HDFS_PATH}/fsimage \
    ${BASE_DIR}/fsimage-${HADOOP_VERSION}

fsimage - passed

${DYN_HOME}/bin/generate-block-lists.sh \
    -fsimage_input_path ${HDFS_PATH}/fsimage/fsimage_0000000000000000894.xml \
    -block_image_output_dir ${HDFS_PATH}/blocks \
    -num_reducers 10 -num_datanodes 3

generate-block-lists - passed

${DYN_HOME}/bin/start-dynamometer-cluster.sh "" \
    -hadoop_binary_path file://${BASE_DIR}/hadoop-${HADOOP_VERSION}.tar.gz \
    -conf_path file://${BASE_DIR}/conf.zip \
    -fs_image_dir ${HDFS_PATH}/fsimage \
    -block_list_path ${HDFS_PATH}/blocks

start-dynamometer-cluster: looks working according to output: ... 19/10/18 03:56:56 INFO dynamometer.Client: NameNode has started! 19/10/18 03:56:56 INFO dynamometer.Client: Waiting for 2 DataNodes to register with the NameNode... 19/10/18 03:57:02 INFO dynamometer.Client: Number of live DataNodes = 2.00; above threshold of 2.00; done waiting after 6017 ms. 19/10/18 03:57:02 INFO dynamometer.Client: Waiting for MissingBlocks to fall below 0.010199999... 19/10/18 03:57:02 INFO dynamometer.Client: Number of missing blocks: 102.00 19/10/18 04:00:03 INFO dynamometer.Client: Number of missing blocks = 0.00; below threshold of 0.01; done waiting after 180082 ms. 19/10/18 04:00:03 INFO dynamometer.Client: Waiting for UnderReplicatedBlocks to fall below 1.02... 19/10/18 04:00:03 INFO dynamometer.Client: Number of under replicated blocks: 102.00

${DYN_HOME}/bin/start-workload.sh \
    -Dauditreplay.log-start-time.ms=1000 \
    -Dauditreplay.input-path=file://${BASE_DIR}/audit_logs-${HADOOP_VERSION} \
    -Dauditreplay.output-path=${RESULTS_DIR} \
    -Dauditreplay.num-threads=1 \
    -nn_uri hdfs://$HOSTNAME:9000/ \
    -start_time_offset 1m \
    -mapper_class_name AuditReplayMapper

start-workload - it started and never finish during couple of hours repeating 'map > map': 19/10/18 04:07:53 INFO workloadgenerator.WorkloadDriver: The workload will start at 1571396933516 ms (2019/10/18 04:08:53 PDT) 19/10/18 04:07:54 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 19/10/18 04:07:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 19/10/18 04:07:55 INFO input.FileInputFormat: Total input files to process : 1 19/10/18 04:07:55 INFO mapreduce.JobSubmitter: number of splits:1 19/10/18 04:07:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local579807884_0001 19/10/18 04:07:55 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 19/10/18 04:07:55 INFO mapreduce.Job: Running job: job_local579807884_0001 19/10/18 04:07:55 INFO mapred.LocalJobRunner: OutputCommitter set in config null 19/10/18 04:07:55 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 19/10/18 04:07:55 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 19/10/18 04:07:55 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 19/10/18 04:07:55 INFO mapred.LocalJobRunner: Waiting for map tasks 19/10/18 04:07:55 INFO mapred.LocalJobRunner: Starting task: attempt_local579807884_0001_m_000000_0 19/10/18 04:07:55 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 19/10/18 04:07:55 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 19/10/18 04:07:55 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 19/10/18 04:07:55 INFO mapred.MapTask: Processing split: file:/home/rscherba/ws/hadoop/dynamometer-test/audit_logs-2.8.4/hdfs-audit.log:0+251649 19/10/18 04:07:55 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 19/10/18 04:07:55 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 19/10/18 04:07:55 INFO mapred.MapTask: soft limit at 83886080 19/10/18 04:07:55 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 19/10/18 04:07:55 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 19/10/18 04:07:55 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 19/10/18 04:07:55 INFO audit.AuditReplayMapper: Starting 1 threads 19/10/18 04:07:55 INFO audit.AuditReplayThread: Start timestamp: 1571396933516 19/10/18 04:07:55 INFO audit.AuditReplayThread: Sleeping for 57526 ms 19/10/18 04:07:56 INFO mapreduce.Job: Job job_local579807884_0001 running in uber mode : false 19/10/18 04:07:56 INFO mapreduce.Job: map 0% reduce 0% 19/10/18 04:08:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:13:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:18:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:23:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:28:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:33:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:38:07 INFO mapred.LocalJobRunner: map > map 19/10/18 04:43:07 INFO mapred.LocalJobRunner: map > map ...

How long Dynamometer workload should work? How run script arguments can affect test run? How to check in logs is there is something wrong in the configuration?

xkrogen commented 4 years ago

Hi @novosibman , thanks for the detailed information!

19/10/18 04:00:03 INFO dynamometer.Client: Waiting for UnderReplicatedBlocks to fall below 1.02...
19/10/18 04:00:03 INFO dynamometer.Client: Number of under replicated blocks: 102.00

Did the number of under replicated blocks ever fall? The start-dynamometer-cluster.sh will wait for this.

Can you also confirm that if you use an HDFS client (say, hdfs dfs) against hdfs://$HOSTNAME:9000/, it works as expected? For one thing, I don't believe 9000 is the correct port number. It will be chosen dynamically; you should look at the NameNode's generated properties file to know what the correct port is. If you run the workload using the start-dynamometer-cluster.sh script, it will handle this automatically.

19/10/18 04:07:55 INFO mapred.LocalJobRunner: Waiting for map tasks
19/10/18 04:07:55 INFO mapred.LocalJobRunner: Starting task: attempt_local579807884_0001_m_000000_0

It looks like your MR job is running locally rather than on a YARN cluster, is that what you intended?

Can you please try visiting the logs for your MR job and seeing if there is any useful information contained within them? 19/10/18 04:07:55 INFO mapreduce.Job: The url to track the job: http://localhost:8080/

novosibman commented 4 years ago

Hi Erik, thanks for quick answer!

You are right 9000 is not correct port for Dyno-HDFS namenode. I use 9000 for fs.defaultFS/fs.default.name in conf for starting accompanying HDFS instance and automatically included it in nn_uri param. I corrected it.

As for accompanying YARN cluster the general configuration looks as following, all components are started out-of-the-box using same hadoop-x.y.z version as in Dynamometer scripts:

host1: accompanying HDFS (namenode, datanode) for storing fsimage and blocks
host2: accompanying YARN (resourcemanager, nodemanager), $ upload-fsimage to hdfs:///host1/fsimage $ generate-block-lists $ start-dynamometer-cluster: _19/10/22 11:15:07 INFO dynamometer.Client: Submitting application to RM 19/10/22 11:15:07 INFO impl.YarnClientImpl: Submitted application application_1571768022694_0001 19/10/22 11:15:08 INFO dynamometer.Client: Track the application at: http://host2:8088/proxy/application_1571768022694_0001/ 19/10/22 11:15:08 INFO dynamometer.Client: Kill the application using: yarn application -kill application_1571768022694_0001 19/10/22 11:15:34 INFO dynamometer.Client: NameNode can be reached via HDFS at: hdfs://host2:9002/ 19/10/22 11:15:34 INFO dynamometer.Client: NameNode web UI available at: http://host2:50077/ 19/10/22 11:15:34 INFO dynamometer.Client: NameNode can be tracked at: http://host2:8042/node/containerlogs/container_1571768022694_0001_01_000002/rscherba/ 19/10/22 11:15:34 INFO dynamometer.Client: Waiting for NameNode to finish starting up... 19/10/22 11:16:37 INFO dynamometer.Client: Startup progress = 1.00; above threshold of 1.00; done waiting after 63132 ms. 19/10/22 11:16:37 INFO dynamometer.Client: NameNode has started! 19/10/22 11:16:37 INFO dynamometer.Client: Waiting for 2 DataNodes to register with the NameNode... 19/10/22 11:16:37 INFO dynamometer.Client: Number of live DataNodes = 2.00; above threshold of 2.00; done waiting after 33 ms. 19/10/22 11:16:37 INFO dynamometer.Client: Waiting for MissingBlocks to fall below 0.010199999... 19/10/22 11:16:37 INFO dynamometer.Client: Number of missing blocks: 102.00 19/10/22 11:16:40 INFO dynamometer.Client: Number of missing blocks = 0.00; below threshold of 0.01; done waiting after 3004 ms. 19/10/22 11:16:40 INFO dynamometer.Client: Waiting for UnderReplicatedBlocks to fall below 1.02... 19/10/22 11:16:40 INFO dynamometer.Client: Number of under replicated blocks: 102.00 19/10/22 11:16:46 INFO dynamometer.Client: Number of under replicated blocks: 98.00 19/10/22 11:16:49 INFO dynamometer.Client: Number of under replicated blocks = 1.00; below threshold of 1.02; done waiting after 9012 ms. 19/10/22 11:16:49 INFO dynamometer.Client: NameNode is ready for use!_ -- suppose this log indicates that Dyno-HDFS cluster is OK for usage and 9002 is correct port for nn_uri, also DynamometerTest YARN application entity was created inside accompanying YARN -- hdfs dfs -ls hdfs://host2:9002/ shows simulated file structure $ start-workload: _19/10/22 11:43:29 INFO mapreduce.JobSubmitter: number of splits:1 19/10/22 11:43:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local415330183_0001 19/10/22 11:43:29 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 19/10/22 11:43:29 INFO mapreduce.Job: Running job: job_local4153301830001 ... 19/10/22 11:58:41 INFO mapred.LocalJobRunner: map > map ... -- still uses local job, localhost:8080 is not accessible, is it a root cause for never ending workload job? Will it help to start YARN on separate host3 and how Dynamometer will find it (which args should be used)?

xkrogen commented 4 years ago

Great, so it looks like the Dyno-HDFS cluster is working as expected.

Your workload MapReduce job is not actually leveraging YARN at all. Is your mapreduce.framework.name configured to be yarn?

nshetty0202 commented 2 years ago

running into the same issue here - logs indicate the job is running locally instead of yarn.. mapreduce.framework.name is set to yarn in /opt/yarn/conf/mapred-site.xml

    <name>mapreduce.framework.name</name>
    <value>yarn</value>

linkedin / dynamometer

Executing Dynamometer workload #104