namhnguyen / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Incorrect crash recovery behavior when the system crashed in the middle of the first bootstrapping #793

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
#. Observed Incorrect behavior from Till

This is the complete NC log that I get:

Jul 15, 2014 11:29:21 AM edu.uci.ics.hyracks.control.nc.NCDriver main
SEVERE: Setting uncaught exception handler 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@60ee8f91
Jul 15, 2014 11:29:21 AM edu.uci.ics.hyracks.control.nc.NodeControllerService 
start
INFO: Starting NodeControllerService
Jul 15, 2014 11:29:21 AM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller  TAKE NOTE: asterix_node1
Jul 15, 2014 11:29:21 AM 
edu.uci.ics.asterix.transaction.management.service.logging.LogManager 
initializeLogAnchor
INFO: log file Id: 1, offset: 0
Jul 15, 2014 11:29:21 AM 
edu.uci.ics.asterix.transaction.management.service.logging.LogManager 
initializeLogManager
INFO: LogManager starts logging in LSN: 2147483648
Jul 15, 2014 11:29:21 AM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: System is in a state: HEALTHY
Jul 15, 2014 11:29:21 AM 
edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepos
itory initialize
INFO: Initializing local resource repository ... 
edu.uci.ics.hyracks.api.exceptions.HyracksDataException: 
java.io.FileNotFoundException: 
/Users/tillw/code/asterix/asterixdb2/asterix-installer/target/asterix-installer-
0.8.7-SNAPSHOT-binary-assembly/clusters/local/working_dir/asterix_root_metadata/
asterix_node1_iodevice0/.asterix_root_metadata (No such file or directory)
    at edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepository.readLocalResource(PersistentLocalResourceRepository.java:305)
    at edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepository.initialize(PersistentLocalResourceRepository.java:135)
    at edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint.start(NCApplicationEntryPoint.java:87)
    at edu.uci.ics.hyracks.control.nc.NodeControllerService.startApplication(NodeControllerService.java:314)
    at edu.uci.ics.hyracks.control.nc.NodeControllerService.start(NodeControllerService.java:257)
    at edu.uci.ics.hyracks.control.nc.NCDriver.main(NCDriver.java:44)
Caused by: java.io.FileNotFoundException: 
/Users/tillw/code/asterix/asterixdb2/asterix-installer/target/asterix-installer-
0.8.7-SNAPSHOT-binary-assembly/clusters/local/working_dir/asterix_root_metadata/
asterix_node1_iodevice0/.asterix_root_metadata (No such file or directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:146)
    at edu.uci.ics.asterix.transaction.management.resource.PersistentLocalResourceRepository.readLocalResource(PersistentLocalResourceRepository.java:300)
    ... 5 more

I think that the problem comes in in NCApplicationEntryPoint.start(...). There 
the recovery manager reports that the system state is not NEW_UNIVERSE, so we 
initialize the localResourceRepository saying that it's not a new universe. 
However, the files that are expected to be there for the initialization are not 
available. So it seems that the actual meaning of the system state NEW_UNIVERSE 
guarantees less than we expect.

#. How does this situation occur?

The following explains how the situation can happen.
------------------------------
When an asterix instance starts for the first time (meaning system state is 
NEW_UNIVERSE), the following steps (pertaining to recovery, checkpoint, and 
persistent local resource repository) are executed in 
NodeControllerService.start() method. 

1. RecoveryMananger checks whether the system state is NEW_UNIVERSE or not. 
Since it is the first bootstrapping, there is no checkpoint file created yet, 
so it is considered NEW_UNIVERSE. (The NEW_UNIVERSE state is determined by the 
fact that whether a checkpoint file exists or not)

2. Since the system state is NEW_UNIVERSE, the recovery manager creates the 
first checkpoint.

(Step 1 and 2 are executed in NCApplicationEntryPoint.start() method.)

3. The node where the recovery manager created the first checkpoint file is 
registered to CC.

4. The persistent local resource repository is initialized. (where the 
“.asterix_root_metadata” file is created.)

5. The metadata boot strapping(i.e., creates metadata dataverse since the 
system state is NEW_UNIVERSE) is executed if the node is the metadata node.

6. MetadataBootStrap.startDDLRecovery() is called. This method take care of any 
incomplete ddl operations. 

(Step 4, 5, and 6 are executed in 
NCApplicationEntryPoint.notifyStartupComplete())

It is possible that the system may crash after step2 and before step4 is 
completed. If this situation occurs, the log that you showed can be created. 
(Once the system succeeds the first bootstrapping, the situation will not occur)

Original issue reported on code.google.com by kiss...@gmail.com on 23 Jul 2014 at 5:40

GoogleCodeExporter commented 9 years ago

Original comment by kiss...@gmail.com on 23 Jul 2014 at 5:40

GoogleCodeExporter commented 9 years ago

Original comment by kiss...@gmail.com on 25 Jul 2014 at 8:21