Resources labs - Githubissues

michalklempa commented 8 years ago

.

michalklempa commented 8 years ago

I am in trouble! @mfernest @mikeridley Exhausted HDFS in a strange way. I originally had /data/1 on each machine with 100GB disk. I already have added 100% of data storage underneatch HDFS (/data/2/) to temporarily get the cluster to healthy state. Dfsadmin output shows 50% util because of that. But imagine how it was before.

[hdfs@klempa2 ~]$ hdfs dfsadmin -report
Configured Capacity: 949987160070 (884.74 GB)
Present Capacity: 949638275487 (884.42 GB)
DFS Remaining: 482583400863 (449.44 GB)
DFS Used: 467054874624 (434.98 GB)
DFS Used%: 49.18%
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (5):

Name: 172.31.9.86:1004 (klempa1.cdh.seb)
Hostname: klempa1.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 91269267456 (85.00 GB)
Non DFS Used: 152301568 (145.25 MB)
DFS Remaining: 98575862990 (91.81 GB)
DFS Used%: 48.04%
DFS Remaining%: 51.88%
Configured Cache Capacity: 10737418240 (10 GB)
Cache Used: 996069376 (949.93 MB)
Cache Remaining: 9741348864 (9.07 GB)
Cache Used%: 9.28%
Cache Remaining%: 90.72%
Xceivers: 2
Last contact: Thu Sep 22 10:01:23 EDT 2016

Name: 172.31.9.83:1004 (klempa5.cdh.seb)
Hostname: klempa5.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 93907169280 (87.46 GB)
Non DFS Used: 144773120 (138.07 MB)
DFS Remaining: 95945489614 (89.36 GB)
DFS Used%: 49.43%
DFS Remaining%: 50.50%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 2973855744 (2.77 GB)
Cache Remaining: 1321111552 (1.23 GB)
Cache Used%: 69.24%
Cache Remaining%: 30.76%
Xceivers: 2
Last contact: Thu Sep 22 10:01:21 EDT 2016

Name: 172.31.9.85:1004 (klempa3.cdh.seb)
Hostname: klempa3.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 95296790528 (88.75 GB)
Non DFS Used: 0 (0 B)
DFS Remaining: 94936071783 (88.42 GB)
DFS Used%: 50.16%
DFS Remaining%: 49.97%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 2008481792 (1.87 GB)
Cache Remaining: 2286485504 (2.13 GB)
Cache Used%: 46.76%
Cache Remaining%: 53.24%
Xceivers: 2
Last contact: Thu Sep 22 10:01:24 EDT 2016

Name: 172.31.9.82:1004 (klempa4.cdh.seb)
Hostname: klempa4.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 94763954176 (88.26 GB)
Non DFS Used: 125235200 (119.43 MB)
DFS Remaining: 95108242638 (88.58 GB)
DFS Used%: 49.88%
DFS Remaining%: 50.06%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 3010818048 (2.80 GB)
Cache Remaining: 1284149248 (1.20 GB)
Cache Used%: 70.10%
Cache Remaining%: 29.90%
Xceivers: 2
Last contact: Thu Sep 22 10:01:24 EDT 2016

Name: 172.31.9.84:1004 (klempa2.cdh.seb)
Hostname: klempa2.cdh.seb
Rack: /default
Decommission Status : Normal
Configured Capacity: 189997432014 (176.95 GB)
DFS Used: 91817693184 (85.51 GB)
Non DFS Used: 162004992 (154.50 MB)
DFS Remaining: 98017733838 (91.29 GB)
DFS Used%: 48.33%
DFS Remaining%: 51.59%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 1010798592 (963.97 MB)
Cache Remaining: 3284168704 (3.06 GB)
Cache Used%: 23.53%
Cache Remaining%: 76.47%
Xceivers: 2
Last contact: Thu Sep 22 10:01:23 EDT 2016

But

[hdfs@klempa2 ~]$ hdfs dfs -du -s -h /
29.7 G  70.4 G  /

But

[root@klempa2 ec2-user]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       40G  6,2G   32G  17% /
tmpfs           7,3G     0  7,3G   0% /dev/shm
/dev/xvdf1       99G   86G   13G  88% /data/1
cm_processes    7,3G   53M  7,3G   1% /var/run/cloudera-scm-agent/process
/dev/xvdg1       99G   60M   99G   1% /data/2

Now lets examine the /data/1

[root@klempa2 ec2-user]# cd /data/1/
[root@klempa2 1]# du -sh *
86G dfs
16K lost+found
156K    yarn
[root@klempa2 1]# cd dfs/
dn/ jn/ nn/ 
[root@klempa2 1]# cd dfs/
[root@klempa2 dfs]# du -sh *
86G dn
19M jn
17M nn
[root@klempa2 dfs]# cd dn
[root@klempa2 dn]# du -sh *
86G current
4,0K    in_use.lock
[root@klempa2 dn]# cd current/
[root@klempa2 current]# du -sh *
86G BP-1704656809-172.31.9.84-1474348946517
4,0K    VERSION
[root@klempa2 current]# cd BP-1704656809-172.31.9.84-1474348946517/
[root@klempa2 BP-1704656809-172.31.9.84-1474348946517]# du -sh *
12G current
0   RollingUpgradeInProgress
4,0K    scanner.cursor
4,0K    tmp
74G trash
[root@klempa2 BP-1704656809-172.31.9.84-1474348946517]# cd trash/
[root@klempa2 trash]# du -sh *
73G finalized
1,6G    rbw
[root@klempa2 trash]# cd finalized/
[root@klempa2 finalized]# du -sh *
73G subdir0
[root@klempa2 finalized]# cd subdir0/
[root@klempa2 subdir0]# du -sh *
508K    subdir16
350M    subdir17
351M    subdir18
230M    subdir19
87M subdir20
108M    subdir21
9,9M    subdir22
106M    subdir23
9,1M    subdir24
25M subdir25
31M subdir26
13M subdir27
7,3M    subdir28
674M    subdir31
10G subdir32
7,6G    subdir33
6,6G    subdir34
6,8G    subdir35
7,0G    subdir36
13G subdir37
9,3G    subdir38
8,6G    subdir39
2,5G    subdir40
1,4M    subdir41
380K    subdir42

mikeridley commented 8 years ago

I'm not sure why hdfs dfsadmin -report shows 434 GB used but hdfs dfs -du -s -h shows 80 GB. My suspicion is that the rest of the space may be used by HDFS Trash. Try running hadoop fs -expunge and see if that gets your space back.

michalklempa commented 8 years ago

Done that, problem persist

mikeridley commented 8 years ago

Let's sync up at the next lab time to take a look. I'm not really sure what is going on yet.

mikeridley commented 8 years ago

If @mfernest has a chance to look at this, it would be helpful. I'm a bit puzzled. Both hdfs fsck and the NN web UI show 435 GB used, yet I can't find anywhere in the filesystem that this is actually used. There are blocks on the data nodes, but it's not clear where their corresponding files are.

We did discover there was an incomplete rolling upgrade in progress, but even after completing that it hasn't released any blocks. We tried restarting the HDFS service, but no success. I'm at a bit of a loss for what could be using this space.

mfernest commented 8 years ago

Is there any chance HDFS snapshots are involved? That would explain a consumption of storage without evidence of it in a filesystem view.

michalklempa / SEBC

Resources labs #3