linkedin / dynamometer

A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
BSD 2-Clause "Simplified" License
131 stars 34 forks source link

start-dynamometer-cluster.sh start NameNode failed #98

Closed Peterlcc closed 5 years ago

Peterlcc commented 5 years ago

start-dynamometer-cluster.sh command: ./bin/start-dynamometer-cluster.sh -conf_path /root/dynamometer/dynamometer0.1.7/myconf -fs_image_dir hdfs:///dyno/fsimage -block_list_path hdfs:///dyno/blocks -hadoop_binary_path /root/dynamometer/dynamometer0.1.7/hadoop-2.8.3.tar.gz

the console error info: 19/07/18 11:23:37 INFO impl.YarnClientImpl: Submitted application application_1563419715675_0002 19/07/18 11:23:38 INFO dynamometer.Client: Track the application at: http://centos-node1:8088/proxy/application_1563419715675_0002/ 19/07/18 11:23:38 INFO dynamometer.Client: Kill the application using: yarn application -kill application_1563419715675_0002 19/07/18 11:23:58 INFO dynamometer.Client: NameNode can be reached via HDFS at: hdfs://centos-node2:9002/ 19/07/18 11:23:58 INFO dynamometer.Client: NameNode web UI available at: http://centos-node2:50077/ 19/07/18 11:23:58 INFO dynamometer.Client: NameNode can be tracked at: http://centos-node2:8042/node/containerlogs/container_1563419715675_0002_01_000002/root/ 19/07/18 11:23:58 INFO dynamometer.Client: Waiting for NameNode to finish starting up... 19/07/18 11:24:09 INFO dynamometer.Client: Infra app exited unexpectedly. YarnState=FINISHED. Exiting from client. 19/07/18 11:24:09 INFO dynamometer.Client: Attempting to clean up remaining running applications. 19/07/18 11:24:09 ERROR dynamometer.Client: Application failed to complete successfully

then i use "$?" to test the state of last commond : [root@centos-node1 dynamometer0.1.7]# echo $? 2

Inspired by the previous question,i check the application's stderr cat: metricsTailPIDFile: No such file or directory ./start-component.sh: line 299: 2207 Terminated sleep 1

and check the application's stdout: starting namenode, logging to /export/server/hadoop-2.8.3/logs/userlogs/application_1563419715675_0002/container_1563419715675_0002_01_000002/hadoop-root-namenode-centos-node2.out log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Going to attempt to upload metrics to: hdfs://centos-node1:9000/user/root/.dynamometer/application_1563419715675_0002/namenode_metrics Unable to upload metrics to HDFS Started namenode at pid 2124 Waiting for parent process (PID: 2005) OR namenode process to exit Cleaning up namenode at pid 2124 Deleting any remaining files

If i have something wrong in configuration?

Peterlcc commented 5 years ago

Sorry for my negligence,i found i lost the core-site.xml in myconf directory.i fixed it immediately,but things are not going well.it still failed.When i look the stderr file.It only shows: ./start-component.sh: line 299: 2284 Terminated sleep 1

i'm lost again,so i try the other direction. i added the option "-namenode_servicerpc_addr centos-node1:9000" to use the physical NameNode.it worked! The applicationMaster started! i think maybe something confguration about the relationship between simulated NameNode and simulated DataNode was lost.Did i ? Meantime i'm confused where the relationship was set?

xkrogen commented 5 years ago

Hi @Peterlcc , it looks like the NameNode is failing to start up. Typically the NameNode will produce logs into a file distinct from the stderr or stdout, do you have any such log files from the container running the NameNode?

By the way, you've shared stderr/stdout logs from the NameNode container, but in general it can also be helpful to also look at the logs for the ApplicationMaster. You can use this link http://centos-node1:8088/proxy/application_1563419715675_0002/ and click through to the AM logs. Though I think in this case it probably won't contain much useful information since the issue seems to be with the NN.

Peterlcc commented 5 years ago

Hi @xkrogen ,thanks for your help.i run it again and found that directory.But it only has these files:

-rw-r--r--. 1 root root  37K Jul 19 09:20 hadoop-root-namenode-centos-node1.log
-rw-r--r--. 1 root root  714 Jul 19 09:20 hadoop-root-namenode-centos-node1.out
-rw-r--r--. 1 root root    0 Jul 19 09:19 hdfs-audit.log
-rw-r--r--. 1 root root    0 Jul 19 09:20 namenode_metrics
-rw-r--r--. 1 root root    0 Jul 19 09:20 SecurityAuth-root.audit
-rw-r--r--. 1 root root   70 Jul 19 09:20 stderr
-rw-r--r--. 1 root root 8.6K Jul 19 09:20 stdout

In the hadoop-root-namenode-centos-node1.log,i only find the INFO level information:

2019-07-19 09:20:03,043 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 2019-07-19 09:20:03,077 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode [-D, fs.defaultFS=hdfs://centos-node1:9002, -D, dfs.namenode.rpc-address=centos-node1:9002, -D, dfs.namenode.servicerpc-address=centos-node1:9022, -D, dfs.namenode.http-address=centos-node1:50077, -D, dfs.namenode.https-address=centos-node1:0, -D, dfs.namenode.name.dir=file:///export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data, -D, dfs.namenode.edits.dir=file:///export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data, -D, dfs.namenode.checkpoint.dir=file:///export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/checkpoint, -D, dfs.namenode.kerberos.internal.spnego.principal=, -D, dfs.hosts=, -D, dfs.hosts.exclude=, -D, dfs.namenode.legacy-oiv-image.dir=, -D, dfs.namenode.kerberos.principal=, -D, dfs.namenode.keytab.file=, -D, dfs.namenode.safemode.threshold-pct=0.0f, -D, dfs.permissions.enabled=true, -D, dfs.cluster.administrators="", -D, dfs.block.replicator.classname=com.linkedin.dynamometer.BlockPlacementPolicyAlwaysSatisfied, -D, hadoop.security.impersonation.provider.class=com.linkedin.dynamometer.AllowAllImpersonationProvider, -D, hadoop.tmp.dir=/export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node, -D, hadoop.security.authentication=simple, -D, hadoop.security.authorization=false, -D, dfs.http.policy=HTTP_ONLY, -D, dfs.nameservices=, -D, dfs.web.authentication.kerberos.principal=, -D, dfs.web.authentication.kerberos.keytab=, -D, hadoop.http.filter.initializers=, -D, dfs.datanode.kerberos.principal=, -D, dfs.datanode.keytab.file=, -D, dfs.domain.socket.path=, -D, dfs.client.read.shortcircuit=false] 2019-07-19 09:20:04,041 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2-namenode.properties 2019-07-19 09:20:04,093 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink dyno-file started 2019-07-19 09:20:04,260 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 60 second(s). 2019-07-19 09:20:04,260 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2019-07-19 09:20:04,301 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://centos-node1:9002 2019-07-19 09:20:04,307 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use centos-node1:9002 to access this namenode/service. 2019-07-19 09:20:05,011 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting JVM pause monitor 2019-07-19 09:20:05,033 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://centos-node1:50077 2019-07-19 09:20:05,140 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2019-07-19 09:20:05,146 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets. 2019-07-19 09:20:05,158 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined 2019-07-19 09:20:05,168 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) 2019-07-19 09:20:05,386 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter) 2019-07-19 09:20:05,392 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/ 2019-07-19 09:20:05,412 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50077 2019-07-19 09:20:05,412 INFO org.mortbay.log: jetty-6.1.26 2019-07-19 09:20:05,732 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@centos-node1:50077 2019-07-19 09:20:05,814 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories! 2019-07-19 09:20:05,815 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of data loss due to lack of redundant storage directories! 2019-07-19 09:20:05,906 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Edit logging is async:true 2019-07-19 09:20:05,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: KeyProvider: null 2019-07-19 09:20:05,928 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair: true 2019-07-19 09:20:05,930 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Detailed lock hold time metrics enabled: false 2019-07-19 09:20:06,020 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000 2019-07-19 09:20:06,037 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true 2019-07-19 09:20:06,041 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000 2019-07-19 09:20:06,042 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2019 Jul 19 09:20:06 2019-07-19 09:20:06,052 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap 2019-07-19 09:20:06,052 INFO org.apache.hadoop.util.GSet: VM type = 64-bit 2019-07-19 09:20:06,066 INFO org.apache.hadoop.util.GSet: 2.0% max memory 966.7 MB = 19.3 MB 2019-07-19 09:20:06,066 INFO org.apache.hadoop.util.GSet: capacity = 2^21 = 2097152 entries 2019-07-19 09:20:06,670 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false 2019-07-19 09:20:06,673 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication = 2 2019-07-19 09:20:06,673 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication = 512 2019-07-19 09:20:06,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication = 1 2019-07-19 09:20:06,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams = 2 2019-07-19 09:20:06,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000 2019-07-19 09:20:06,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer = false 2019-07-19 09:20:06,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog = 1000 2019-07-19 09:20:06,689 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner = root (auth:SIMPLE) 2019-07-19 09:20:06,689 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup = supergroup 2019-07-19 09:20:06,689 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true 2019-07-19 09:20:06,689 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: false 2019-07-19 09:20:06,691 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true 2019-07-19 09:20:08,177 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap 2019-07-19 09:20:08,178 INFO org.apache.hadoop.util.GSet: VM type = 64-bit 2019-07-19 09:20:08,178 INFO org.apache.hadoop.util.GSet: 1.0% max memory 966.7 MB = 9.7 MB 2019-07-19 09:20:08,178 INFO org.apache.hadoop.util.GSet: capacity = 2^20 = 1048576 entries 2019-07-19 09:20:08,191 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: ACLs enabled? false 2019-07-19 09:20:08,191 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: XAttrs enabled? true 2019-07-19 09:20:08,192 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occurring more than 10 times 2019-07-19 09:20:08,237 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks 2019-07-19 09:20:08,237 INFO org.apache.hadoop.util.GSet: VM type = 64-bit 2019-07-19 09:20:08,237 INFO org.apache.hadoop.util.GSet: 0.25% max memory 966.7 MB = 2.4 MB 2019-07-19 09:20:08,238 INFO org.apache.hadoop.util.GSet: capacity = 2^18 = 262144 entries 2019-07-19 09:20:08,241 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.0 2019-07-19 09:20:08,241 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0 2019-07-19 09:20:08,241 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000 2019-07-19 09:20:08,282 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10 2019-07-19 09:20:08,282 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10 2019-07-19 09:20:08,282 INFO org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25 2019-07-19 09:20:08,284 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled 2019-07-19 09:20:08,284 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis 2019-07-19 09:20:08,286 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache 2019-07-19 09:20:08,286 INFO org.apache.hadoop.util.GSet: VM type = 64-bit 2019-07-19 09:20:08,286 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB 2019-07-19 09:20:08,286 INFO org.apache.hadoop.util.GSet: capacity = 2^15 = 32768 entries 2019-07-19 09:20:08,556 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data/in_use.lock acquired by nodename 2741@centos-node1 2019-07-19 09:20:08,625 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data/current 2019-07-19 09:20:08,629 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected. 2019-07-19 09:20:08,629 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Planning to load image: FSImageFile(file=/export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data/current/fsimage_0000000000000028332, cpktTxId=0000000000000028332) 2019-07-19 09:20:08,860 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1134 INodes. 2019-07-19 09:20:09,390 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds. 2019-07-19 09:20:09,391 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 28332 from /export/data/hddata/tmp/nm-local-dir/usercache/root/appcache/application_1563499091138_0001/container_1563499091138_0001_01_000002/dyno-node/name-data/current/fsimage_0000000000000028332 2019-07-19 09:20:09,394 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=false, haEnabled=false, isRollingUpgrade=false) 2019-07-19 09:20:09,407 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 28333 2019-07-19 09:20:09,895 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups 2019-07-19 09:20:09,895 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 1605 msecs 2019-07-19 09:20:11,024 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Service RPC server is binding to centos-node1:9022 2019-07-19 09:20:11,069 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 1000 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler

And the stderr still:

./start-component.sh: line 299: 2834 Terminated sleep 1

i looked the start-component.sh.i think the job has been submited. am i right? After the operation,i find that the file namenode_metrics is empty.is it a problem i ignored?

xkrogen commented 5 years ago

@Peterlcc, sorry for taking a while to respond. Thanks for sharing the NameNode logs. You're right that they look fine. I took a closer look at the stdout:

Started namenode at pid 2124
Waiting for parent process (PID: 2005) OR namenode process to exit
Cleaning up namenode at pid 2124
Deleting any remaining files

and it seems something fishy is going on; if the end of the script was reached, one of the two branches should have been triggered:

if kill -0 $PPID 2>/dev/null; then
  echo "$component process exited; continuing to finish"
  exit 1
else
  echo "Parent process exited; continuing to finish"
  exit 0
fi

The line 299 in stderr doesn't line up with the line numbers in the head of master, which has this sleep on line 316 (https://github.com/linkedin/dynamometer/blob/master/dynamometer-infra/src/main/resources/start-component.sh#L316). But it would seem that the process probably received a kill/term signal.

Can you follow my instructions for taking a look at the Application Master logs? Perhaps the AM decided that it needed to exit and kill the NN container for some reason.

Peterlcc commented 5 years ago

@xkrogen ,Thanks for your patient help,I tried another hadoop version 2.7.4 with Pseudo distribution in VMWare. For smooth running,I expanded the memory for yarn,such as yarn.nodemanager.resource.memory-mb , yarn.scheduler.maximum-allocation-mb.It worked! The simulated NameNode started! Then i tried the same change in the previous Fully-Distributed in VMWare. The simulated NameNode started too. But when i start the workload,there is a problem and the workload was stuck at map 0%. I found the container log : 19/07/25 10:25:05 INFO dynamometer.ApplicationMaster: Initializing ApplicationMaster 19/07/25 10:25:06 INFO dynamometer.ApplicationMaster: Application master for app, appId=1, clustertimestamp=1564021161624, attemptId=2 19/07/25 10:25:06 INFO dynamometer.ApplicationMaster: Starting ApplicationMaster 19/07/25 10:25:07 INFO client.RMProxy: Connecting to ResourceManager at centos-node1/192.168.52.132:8030 19/07/25 10:25:07 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500 19/07/25 10:25:07 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0 19/07/25 10:25:08 INFO dynamometer.ApplicationMaster: Requested NameNode ask: Capability[<memory:2048, vCores:1>]Priority[0] 19/07/25 10:25:08 INFO dynamometer.ApplicationMaster: Waiting on availability of NameNode information at hdfs://centos-node1:9000/user/root/.dynamometer/application_1564021161624_0001/nn_info.prop 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: NameNode information: {NM_HTTP_PORT=8042, NN_HOSTNAME=centos-node2, NN_HTTP_PORT=50077, NN_SERVICERPC_PORT=9022, NN_RPC_PORT=9002, CONTAINER_ID=container_1564021161624_0001_01_000002} 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: NameNode can be reached at: hdfs://centos-node2:9002/ 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Waiting for NameNode to finish starting up... 19/07/25 10:25:10 INFO impl.AMRMClientImpl: Received new token for : centos-node2:40931 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Launching NAMENODE on a new container., containerId=container_1564021161624_0001_02_000002, containerNode=centos-node2:40931, containerNodeURI=centos-node2:8042, containerResourceMemory=2048, containerResourceVirtualCores=1 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Setting up container launch context for containerid=container_1564021161624_0001_02_000002, isNameNode=true 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Completed setting up command for namenode: [./start-component.sh, namenode, hdfs://centos-node1:9000/user/root/.dynamometer/application_1564021161624_0001, 1><LOG_DIR>/stdout, 2><LOG_DIR>/stderr] 19/07/25 10:25:10 INFO dynamometer.ApplicationMaster: Starting NAMENODE; track at: http://centos-node2:8042/node/containerlogs/container_1564021161624_0001_02_000002/root/ 19/07/25 10:25:11 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1564021161624_0001_02_000002 19/07/25 10:25:11 INFO dynamometer.ApplicationMaster: NameNode container started at ID container_1564021161624_0001_02_000002 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Startup progress = 1.00; above threshold of 1.00; done waiting after 75668 ms. 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: NameNode has started! 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Looking for block listing files in hdfs:/dyno/blocks 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Requesting 1 DataNode containers with 2048MB memory, 1 vcores, 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Finished requesting datanode containers 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Waiting for 0 DataNodes to register with the NameNode... 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Number of live DataNodes = 0.00; above threshold of 0.00; done waiting after 4 ms. 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Launching thread to trigger block reports for Datanodes with <2028 blocks reported 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Waiting for MissingBlocks to fall below 0.101399995... 19/07/25 10:26:26 INFO dynamometer.ApplicationMaster: Number of missing blocks: 708.00 19/07/25 10:26:28 INFO dynamometer.ApplicationMaster: Got response from RM for container ask, allocatedCnt=1 19/07/25 10:26:28 INFO dynamometer.ApplicationMaster: Launching DATANODE on a new container., containerId=container_1564021161624_0001_02_000003, containerNode=centos-node2:40931, containerNodeURI=centos-node2:8042, containerResourceMemory=2048, containerResourceVirtualCores=1 19/07/25 10:26:28 INFO dynamometer.ApplicationMaster: Setting up container launch context for containerid=container_1564021161624_0001_02_000003, isNameNode=false 19/07/25 10:26:28 INFO dynamometer.ApplicationMaster: Completed setting up command for datanode: [./start-component.sh, datanode, hdfs://centos-node2:9022/, 0, 1><LOG_DIR>/stdout, 2><LOG_DIR>/stderr] 19/07/25 10:26:28 INFO dynamometer.ApplicationMaster: Starting DATANODE; track at: http://centos-node2:8042/node/containerlogs/container_1564021161624_0001_02_000003/root/ 19/07/25 10:26:28 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1564021161624_0001_02_000003 19/07/25 10:26:28 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1564021161624_0001_02_000003 19/07/25 10:26:32 INFO dynamometer.ApplicationMaster: Number of missing blocks: 1014.00 19/07/25 10:27:32 INFO dynamometer.ApplicationMaster: Number of missing blocks = 0.00; below threshold of 0.10; done waiting after 66257 ms. 19/07/25 10:27:32 INFO dynamometer.ApplicationMaster: Waiting for UnderReplicatedBlocks to fall below 10.139999... 19/07/25 10:27:32 INFO dynamometer.ApplicationMaster: Number of under replicated blocks: 1014.00 19/07/25 10:29:26 INFO dynamometer.ApplicationMaster: Queueing 1 Datanodes for block report: centos-node2:38065 19/07/25 10:29:27 FATAL yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-12,5,main] threw an Error. Shutting down now... java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.DFSUtil.createClientDatanodeProtocolProxy(Ljava/net/InetSocketAddress;Lorg/apache/hadoop/security/UserGroupInformation;Lorg/apache/hadoop/conf/Configuration;Ljavax/net/SocketFactory;)Lorg/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol; at com.linkedin.dynamometer.DynoInfraUtils.triggerDataNodeBlockReport(DynoInfraUtils.java:320) at com.linkedin.dynamometer.DynoInfraUtils.access$000(DynoInfraUtils.java:51) at com.linkedin.dynamometer.DynoInfraUtils$1.run(DynoInfraUtils.java:273) 19/07/25 10:29:27 INFO util.ExitUtil: Exiting with status -1

I saw the exception "java.lang.NoSuchMethodError".If the different version made it?

Meantime,the phenomenon didn't appear on version 2.7.4,but workload was stuck at map 2% continuously.Maybe i should open another github issue for a good issueorganization.

xkrogen commented 5 years ago

Hm... It appears that API moved in HDFS-8925 which first appeared in 2.8.0. If you're using 2.7.4, you shouldn't encounter that bug. Are you using stock Hadoop 2.7.4 or something vendor release?

Either way, I'm putting together a commit now which will make it compatible with either version of the method (<=2.7 as well as 2.8+).

Peterlcc commented 5 years ago

@xkrogen At the begining i used Hadoop 2.8.3, then i found the bug with the different API.Now i use Hadoop 2.7.4. And i have run the program successfully yesterday. Thank you for your help all the time.

xkrogen commented 5 years ago

Great! I'm tracking in #99 updating the README to have some discussion of what versions are properly supported. Sorry for the trouble.