Open michalklempa opened 8 years ago
I am in trouble! @mfernest @mikeridley
Exhausted HDFS in a strange way. I originally had /data/1
on each machine with 100GB disk.
I already have added 100% of data storage underneatch HDFS (/data/2/
) to temporarily get the cluster to healthy state. Dfsadmin output shows 50% util because of that. But imagine how it was before.
[hdfs@klempa2 ~]$ hdfs dfsadmin -report
Configured Capacity: 949987160070 (884.74 GB)
Present Capacity: 949638275487 (884.42 GB)
DFS Remaining: 482583400863 (449.44 GB)
DFS Used: 467054874624 (434.98 GB)
DFS Used%: 49.18%
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (5):
Name: 172.31.9.86:1004 (klempa1.cdh.seb)
Hostname: klempa1.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 91269267456 (85.00 GB)
Non DFS Used: 152301568 (145.25 MB)
DFS Remaining: 98575862990 (91.81 GB)
DFS Used%: 48.04%
DFS Remaining%: 51.88%
Configured Cache Capacity: 10737418240 (10 GB)
Cache Used: 996069376 (949.93 MB)
Cache Remaining: 9741348864 (9.07 GB)
Cache Used%: 9.28%
Cache Remaining%: 90.72%
Xceivers: 2
Last contact: Thu Sep 22 10:01:23 EDT 2016
Name: 172.31.9.83:1004 (klempa5.cdh.seb)
Hostname: klempa5.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 93907169280 (87.46 GB)
Non DFS Used: 144773120 (138.07 MB)
DFS Remaining: 95945489614 (89.36 GB)
DFS Used%: 49.43%
DFS Remaining%: 50.50%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 2973855744 (2.77 GB)
Cache Remaining: 1321111552 (1.23 GB)
Cache Used%: 69.24%
Cache Remaining%: 30.76%
Xceivers: 2
Last contact: Thu Sep 22 10:01:21 EDT 2016
Name: 172.31.9.85:1004 (klempa3.cdh.seb)
Hostname: klempa3.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 95296790528 (88.75 GB)
Non DFS Used: 0 (0 B)
DFS Remaining: 94936071783 (88.42 GB)
DFS Used%: 50.16%
DFS Remaining%: 49.97%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 2008481792 (1.87 GB)
Cache Remaining: 2286485504 (2.13 GB)
Cache Used%: 46.76%
Cache Remaining%: 53.24%
Xceivers: 2
Last contact: Thu Sep 22 10:01:24 EDT 2016
Name: 172.31.9.82:1004 (klempa4.cdh.seb)
Hostname: klempa4.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 94763954176 (88.26 GB)
Non DFS Used: 125235200 (119.43 MB)
DFS Remaining: 95108242638 (88.58 GB)
DFS Used%: 49.88%
DFS Remaining%: 50.06%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 3010818048 (2.80 GB)
Cache Remaining: 1284149248 (1.20 GB)
Cache Used%: 70.10%
Cache Remaining%: 29.90%
Xceivers: 2
Last contact: Thu Sep 22 10:01:24 EDT 2016
Name: 172.31.9.84:1004 (klempa2.cdh.seb)
Hostname: klempa2.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 91817693184 (85.51 GB)
Non DFS Used: 162004992 (154.50 MB)
DFS Remaining: 98017733838 (91.29 GB)
DFS Used%: 48.33%
DFS Remaining%: 51.59%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 1010798592 (963.97 MB)
Cache Remaining: 3284168704 (3.06 GB)
Cache Used%: 23.53%
Cache Remaining%: 76.47%
Xceivers: 2
Last contact: Thu Sep 22 10:01:23 EDT 2016
But
[hdfs@klempa2 ~]$ hdfs dfs -du -s -h /
29.7 G 70.4 G /
But
[root@klempa2 ec2-user]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 40G 6,2G 32G 17% /
tmpfs 7,3G 0 7,3G 0% /dev/shm
/dev/xvdf1 99G 86G 13G 88% /data/1
cm_processes 7,3G 53M 7,3G 1% /var/run/cloudera-scm-agent/process
/dev/xvdg1 99G 60M 99G 1% /data/2
Now lets examine the /data/1
[root@klempa2 ec2-user]# cd /data/1/
[root@klempa2 1]# du -sh *
86G dfs
16K lost+found
156K yarn
[root@klempa2 1]# cd dfs/
dn/ jn/ nn/
[root@klempa2 1]# cd dfs/
[root@klempa2 dfs]# du -sh *
86G dn
19M jn
17M nn
[root@klempa2 dfs]# cd dn
[root@klempa2 dn]# du -sh *
86G current
4,0K in_use.lock
[root@klempa2 dn]# cd current/
[root@klempa2 current]# du -sh *
86G BP-1704656809-172.31.9.84-1474348946517
4,0K VERSION
[root@klempa2 current]# cd BP-1704656809-172.31.9.84-1474348946517/
[root@klempa2 BP-1704656809-172.31.9.84-1474348946517]# du -sh *
12G current
0 RollingUpgradeInProgress
4,0K scanner.cursor
4,0K tmp
74G trash
[root@klempa2 BP-1704656809-172.31.9.84-1474348946517]# cd trash/
[root@klempa2 trash]# du -sh *
73G finalized
1,6G rbw
[root@klempa2 trash]# cd finalized/
[root@klempa2 finalized]# du -sh *
73G subdir0
[root@klempa2 finalized]# cd subdir0/
[root@klempa2 subdir0]# du -sh *
508K subdir16
350M subdir17
351M subdir18
230M subdir19
87M subdir20
108M subdir21
9,9M subdir22
106M subdir23
9,1M subdir24
25M subdir25
31M subdir26
13M subdir27
7,3M subdir28
674M subdir31
10G subdir32
7,6G subdir33
6,6G subdir34
6,8G subdir35
7,0G subdir36
13G subdir37
9,3G subdir38
8,6G subdir39
2,5G subdir40
1,4M subdir41
380K subdir42
I'm not sure why hdfs dfsadmin -report shows 434 GB used but hdfs dfs -du -s -h shows 80 GB. My suspicion is that the rest of the space may be used by HDFS Trash. Try running hadoop fs -expunge and see if that gets your space back.
Done that, problem persist
Let's sync up at the next lab time to take a look. I'm not really sure what is going on yet.
If @mfernest has a chance to look at this, it would be helpful. I'm a bit puzzled. Both hdfs fsck and the NN web UI show 435 GB used, yet I can't find anywhere in the filesystem that this is actually used. There are blocks on the data nodes, but it's not clear where their corresponding files are.
We did discover there was an incomplete rolling upgrade in progress, but even after completing that it hasn't released any blocks. We tried restarting the HDFS service, but no success. I'm at a bit of a loss for what could be using this space.
Is there any chance HDFS snapshots are involved? That would explain a consumption of storage without evidence of it in a filesystem view.
.