linearregression / hypertable

Automatically exported from code.google.com/p/hypertable
GNU General Public License v2.0
0 stars 0 forks source link

RangeServer shutdown sequence problem #357

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hypertable went down in Zvents production.  Errors were being thrown in the
RangeServer logs (motherlode010):

1259945755 INFO Hypertable.RangeServer :
(/usr/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:2128)
Memory Usage: 1043909075 bytes
1259945775 INFO Hypertable.RangeServer :
(/usr/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:2128)
Memory Usage: 1043909075 bytes
1259945781 ERROR Hypertable.RangeServer :
(/usr/src/hypertable/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:109)
DFS read error (amount=65536, fd=7420
231) : java.io.IOException: Could not obtain block:
blk_-4670261240677674535_1028923
file=/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs1
1259945781 ERROR Hypertable.RangeServer :
(/usr/src/hypertable/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:109)
DFS read error (amount=65536, fd=7420
231) : java.io.IOException: Could not obtain block:
blk_-4670261240677674535_1028923
file=/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs1
1259945781 ERROR Hypertable.RangeServer : fetch_next_block_readahead
(/usr/src/hypertable/src/cc/Hypertable/RangeServer/CellStoreScannerIntervalReada
head.cc:
264): Error reading cell store
(/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs17)
block: Hypertable::Exception: Error reading 26 bytes fr
om DFS fd 7420231 - DFS BROKER i/o error
   at virtual size_t Hypertable::DfsBroker::Client::read(int32_t, void*,
size_t) (/usr/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:261)
   at size_t Hypertable::ClientBufferedReaderHandler::read(void*, size_t)
(/usr/src/hypertable/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:158):
 java.io.IOException: Could not obtain block:
blk_-4670261240677674535_1028923
file=/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs1
1259945781 ERROR Hypertable.RangeServer : create_scanner
(/usr/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:700):
Hypertable::Exception:  - DF
S BROKER i/o error
  at virtual Hypertable::CellListScanner*
Hypertable::Range::create_scanner(Hypertable::ScanContextPtr&)
(/usr/src/hypertable/src/cc/Hypertable/RangeSe
rver/Range.cc:315)
        at virtual Hypertable::CellListScanner*
Hypertable::AccessGroup::create_scanner(Hypertable::ScanContextPtr&)
(/usr/src/hypertable/src/cc/Hypertable/R
angeServer/AccessGroup.cc:212): Problem creating scanner on access group
change_logs[..artist_1156818_1248339341.98958](default)
                                at bool
Hypertable::CellStoreScannerIntervalReadahead<IndexT>::fetch_next_block_readahea
d()
[with IndexT = Hypertable::CellStoreBlockIndexMap<unsigne
d int>]
(/usr/src/hypertable/src/cc/Hypertable/RangeServer/CellStoreScannerIntervalReada
head.cc:267):
Error reading 26 bytes from DFS fd 7420231
  at virtual size_t Hypertable::DfsBroker::Client::read(int32_t, void*,
size_t) (/usr/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:261): Error reading
 26 bytes from DFS fd 7420231
 at size_t Hypertable::ClientBufferedReaderHandler::read(void*, size_t)
(/usr/src/hypertable/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:158):
 java.io.IOException: Could not obtain block:
blk_-4670261240677674535_1028923
file=/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs1

Then before that, there were some errors in the DfsBroker log:

Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
INFO: Removing file
'/hypertable/tables/hourly_metrics/ymal_view/159801F095F0849995E49E77/cs0
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
SEVERE: I/O exception while removing file
'/hypertable/tables/hourly_metrics/ymal_view/159801F095F0849995E49E77/cs0'
- java.io.IOException: Problem deleting
file '/hypertable/tables/hourly_metrics/ymal_view/159801F095F0849995E49E77/cs0'
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
INFO: Removing file
'/hypertable/tables/hourly_metrics/featured_events_click/B489CD14EBDC8F227FFB6DD
E/cs12
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
SEVERE: I/O exception while removing file
'/hypertable/tables/hourly_metrics/featured_events_click/B489CD14EBDC8F227FFB6DD
E/cs12'
- java.io.IOException: Prob
lem deleting file
'/hypertable/tables/hourly_metrics/featured_events_click/B489CD14EBDC8F227FFB6DD
E/cs12'
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
INFO: Removing file
'/hypertable/tables/hourly_metric_by_partners/nearby_click/68666E3DADB270C0B2D65
3E0/cs1
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
SEVERE: I/O exception while removing file
'/hypertable/tables/hourly_metric_by_partners/nearby_click/68666E3DADB270C0B2D65
3E0/cs1'
- java.io.IOException: Pro
blem deleting file
'/hypertable/tables/hourly_metric_by_partners/nearby_click/68666E3DADB270C0B2D65
3E0/cs1'
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove
INFO: Removing file
'/hypertable/tables/hourly_metrics/recently_added_events_click/39025DD5C6E2371EA
6572E40/cs8
Dec 4, 2009 12:01:14 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Remove

I shut down Hypertable and then ran Hadoop fsck which showed the system to
be healthy:

................................................................................
....................
.........................................................Status: HEALTHY
 Total size:                                                    
3322189130623 B (Total open files size: 67368481 B)
 Total dirs:                                                     261978
 Total files:                                                    496457
(Files currently being written: 13)
 Total blocks (validated):                                       493547
(avg. block size 6731251 B) (Total open file blocks (not validated): 13)
 Minimally replicated blocks:                                    493547
(100.0 %)
 Over-replicated blocks:                                         0 (0.0 %)
 Under-replicated blocks:                                        0 (0.0 %)
 Mis-replicated blocks:                                            0 (0.0 %)
 Default replication factor:                                       3
 Average block replication:                                        3.0000122
 Corrupt blocks:                                                        0
 Missing replicas:                                                        
     0 (0.0 %)
 Number of data-nodes:                                                    
       9
 Number of racks:                                                         
             1

When I tried to bring the system back up, the RangeServer on motherlode010
was having trouble replaying the root commit log:

Dec 4, 2009 10:00:33 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Mkdirs
INFO: Making directory '/hypertable/servers/10.0.30.213_38060/log/root/'
Dec 4, 2009 10:00:33 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Create
INFO: Creating file '/hypertable/servers/10.0.30.213_38060/log/root/8'
handle = 10
Dec 4, 2009 10:01:14 AM org.hypertable.AsyncComm.IOHandler DeliverEvent
INFO: [/127.0.0.1:33488 ; Fri Dec 04 10:01:14 PST 2009] Connection Established
Dec 4, 2009 10:02:24 AM org.hypertable.AsyncComm.IOHandler DeliverEvent
INFO: [/127.0.0.1:33495 ; Fri Dec 04 10:02:24 PST 2009] Connection Established
Dec 4, 2009 10:03:38 AM org.hypertable.AsyncComm.IOHandler DeliverEvent
INFO: [/127.0.0.1:33513 ; Fri Dec 04 10:03:38 PST 2009] Connection Established
Dec 4, 2009 10:04:49 AM org.hypertable.AsyncComm.IOHandler DeliverEvent
INFO: [/127.0.0.1:33535 ; Fri Dec 04 10:04:49 PST 2009] Connection Established
Dec 4, 2009 10:05:33 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Create
SEVERE: I/O exception while creating file
'/hypertable/servers/10.0.30.213_38060/log/root/8' -
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
create file /hypertable/servers/10.0.30.213_38060/log/root/8 for
DFSClient_-367911170 on client 10.0.30.213, because this file is already
being created by DFSClient_11706822 on 10.0.30.213
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesyst
em.java:1064)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:
977)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:377)
        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

A listing of the /hypertable/servers/10.0.30.213_38060/log/root/ directory
shows:

[zvents@motherlode010 hypertable]$ hadoop fs -ls
/hypertable/servers/10.0.30.213_38060/log/root/
Found 9 items
-rw-r--r--   3 zvents supergroup        452 2009-09-22 23:19
/hypertable/servers/10.0.30.213_38060/log/root/0
-rw-r--r--   3 zvents supergroup        289 2009-09-28 02:15
/hypertable/servers/10.0.30.213_38060/log/root/1
-rw-r--r--   3 zvents supergroup        410 2009-09-28 02:20
/hypertable/servers/10.0.30.213_38060/log/root/2
-rw-r--r--   3 zvents supergroup        440 2009-09-28 02:47
/hypertable/servers/10.0.30.213_38060/log/root/3
-rw-r--r--   3 zvents supergroup        438 2009-09-28 02:54
/hypertable/servers/10.0.30.213_38060/log/root/4
-rw-r--r--   3 zvents supergroup        440 2009-09-28 03:14
/hypertable/servers/10.0.30.213_38060/log/root/5
-rw-r--r--   3 zvents supergroup        439 2009-09-28 03:58
/hypertable/servers/10.0.30.213_38060/log/root/6
-rw-r--r--   3 zvents supergroup    6271630 2009-09-28 04:10
/hypertable/servers/10.0.30.213_38060/log/root/7
-rw-r--r--   3 zvents supergroup          0 2009-10-17 14:48
/hypertable/servers/10.0.30.213_38060/log/root/8

It appears that we may have been bitten by the fsync problem.  But our
shutdown sequence may be bad.  When I did a 'cap stop' some of the Range
servers didn't shut down properly.  The attached RangeServer log file from
motherlode010 shows that the connection to the DFS broker dies before the
ROOT commit log is properly closed.

Original issue reported on code.google.com by nuggetwh...@gmail.com on 5 Dec 2009 at 12:57

Attachments:

GoogleCodeExporter commented 9 years ago
Also, this was in the DfsBroker log on motherlode010:

Dec 4, 2009 8:59:47 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs16' bs=0 
handle =
7420671
Dec 4, 2009 8:59:47 AM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/89254DFC45DAC3F734E3A2E2/cs17' bs=0 
handle =
7420672
09/12/04 08:59:47 INFO hdfs.DFSClient: Could not obtain block
blk_-4670261240677674535_1028923 from any node:  java.io.IOException: No live 
nodes
contain cur
rent block
09/12/04 08:59:50 INFO hdfs.DFSClient: Could not obtain block
blk_-4670261240677674535_1028923 from any node:  java.io.IOException: No live 
nodes
contain cur
rent block
09/12/04 08:59:53 INFO hdfs.DFSClient: Could not obtain block
blk_-4670261240677674535_1028923 from any node:  java.io.IOException: No live 
nodes
contain cur
rent block
java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
    at org.hypertable.AsyncComm.IOHandlerAccept.run(IOHandlerAccept.java:44)
    at org.hypertable.AsyncComm.Reactor.run(Reactor.java:83)
    at java.lang.Thread.run(Thread.java:619)
java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
    at org.hypertable.AsyncComm.IOHandlerAccept.run(IOHandlerAccept.java:44)
    at org.hypertable.AsyncComm.Reactor.run(Reactor.java:83)
    at java.lang.Thread.run(Thread.java:619)

Original comment by nuggetwh...@gmail.com on 6 Dec 2009 at 1:17

GoogleCodeExporter commented 9 years ago
Oddly enough, there are a lot of open & close requests on the ChangeLog cell 
stores.
 This is the tail of the DfsBroker log file on motherlode010:

INFO: Closing input file
/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs1 handle 45696
Dec 6, 2009 12:34:01 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs0' bs=0 
handle = 45697
Dec 6, 2009 12:34:01 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Close
INFO: Closing input file
/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs0 handle 45697
Dec 6, 2009 12:34:01 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs1' bs=0 
handle = 45698
Dec 6, 2009 12:34:01 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Close
INFO: Closing input file
/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs1 handle 45698
Dec 6, 2009 12:34:06 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs0' bs=0 
handle = 45699
Dec 6, 2009 12:34:06 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Open
INFO: Opening file
'/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs1' bs=0 
handle = 45700
Dec 6, 2009 12:34:06 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Close
INFO: Closing input file
/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs0 handle 45699
Dec 6, 2009 12:34:06 PM org.hypertable.DfsBroker.hadoop.HdfsBroker Close
INFO: Closing input file
/hypertable/tables/change_logs/default/AB2A0D28DE6B77FFDD6C72AF/cs1 handle 45700

Original comment by nuggetwh...@gmail.com on 6 Dec 2009 at 8:35

GoogleCodeExporter commented 9 years ago
The problem alluded to in the previous comment was fixed with this commit:

commit bdca9a1ef9650f5dacf4a3f86a0c460affb2c89f
Author: Sanjit Jhala <sjhalaz@gmail.com>
Date:   Mon Sep 28 16:19:44 2009 -0700

    Fixed CellStoreScanner to turn off readahead mode in case of single row scans.

Original comment by nuggetwh...@gmail.com on 7 Dec 2009 at 2:16

GoogleCodeExporter commented 9 years ago
This was just because the system ran out of file handles.  We upped the 
descriptor limit on the production 
cluster.

Original comment by nuggetwh...@gmail.com on 11 Apr 2010 at 11:34