orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.74k stars 870 forks source link

No proper High availability on a distributed setup #9311

Closed bulanan closed 3 years ago

bulanan commented 4 years ago

OrientDB Version: 3.0.30

Java Version: 1.8

OS: official dockers containers running on Linux Centos

Expected behavior

running a setup of 3 nodes master-master newNodeStrategy dynamic and running our application with this database, when I stop one of the containers, the one with most connections and bring it back up, I expect that the application will continue to work as normal, and the connection on the live query will remain.

Actual behavior

the application gets a lot of "live query disconnection" exceptions and the app can't overcome this, also I get the impression that the data is inconsistent.

I'd like to know if there is something wrong with the configuration? is it a bug?, and if not how would you define "high availability"?

Steps to reproduce

1.set up of 3 nodes (configuration + docker-compose attached below) 2.run an up (using java API) that connects to the cluster load data and perform live queries on it 3.stop one node 4.start the node node

[default-distributed-db-config.txt.txt](https://github.com/orientechnologies/orientdb/files/4825707/default-distributed-db-

docker-compose.zip

config.txt.txt)

andrii0lomakin commented 4 years ago

Hi @bulanan, thank you for the report. Can you by chance to provide Java application which you used to test the system?

bulanan commented 4 years ago

Hi @laa, The application code is very large and complicated so I can't provide it here, but basically what it does - It's a web application.

  1. Inserts/updates/deletes records every second
  2. Opens live query sessions with queries over the objects in the system (can be several live queries at a time, around 20 ) and observe those for as long as a user is visiting a screen. 3.if a user leaves the screen the live query of that screen unsubscribes

I can try to create a simpler test to reproduce it.

bulanan commented 4 years ago

Hi @laa

The below test reproduces the problem. a short description of the test: it runs for around 15 min. every second it creates a new element in myclass table. at the same time there's a live query on this table and there's an observable (we use reactive java - rxjava2) that emits each time the live query emits a value. I surrounded by try-catch the part of creating a record. during the running of this test I run "docker-compose stop odb1" followed by "docker-compose start odb1" I chose the node with most connections. please see down below the output after restarting the node.

import com.orientechnologies.common.exception.OException;
import com.orientechnologies.common.serialization.types.OBinaryTypeSerializer;
import com.orientechnologies.orient.core.command.OCommandExecutor;
import com.orientechnologies.orient.core.command.OCommandRequestText;
import com.orientechnologies.orient.core.db.*;
import com.orientechnologies.orient.core.db.document.ODatabaseDocument;
import com.orientechnologies.orient.core.index.ORuntimeKeyIndexDefinition;
import com.orientechnologies.orient.core.metadata.schema.OClass;
import com.orientechnologies.orient.core.metadata.schema.OType;
import com.orientechnologies.orient.core.metadata.sequence.OSequence;
import com.orientechnologies.orient.core.record.OElement;
import com.orientechnologies.orient.core.record.impl.ODocument;
import com.orientechnologies.orient.core.sql.executor.OResult;
import io.reactivex.Completable;
import io.reactivex.Observable;
import io.reactivex.functions.Consumer;
import io.reactivex.observers.BaseTestConsumer;
import io.reactivex.observers.TestObserver;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
    @Before
    public void setUp() {
        if (dbClient != null) {
            dbClient.close();
        }
        dbClient = new OrientDB(dbUrlRemote, "root", "root", OrientDBConfig.defaultConfig());

        if (!dbClient.exists(dbName)) {
            dbClient.createIfNotExists(dbName, ODatabaseType.PLOCAL);
        }
    }

    @After
    public void tearDown() {
        if (dbClient != null) {
            System.out.println("dropping db: "+dbName);
            dbClient.drop(dbName);
            dbClient.close();
            dbClient = null;
        }

    }
    @Test
    public void testLiveQuery() throws Exception {
        try (ODatabaseSession session = dbClient.open(dbName, "admin", "admin")) {
            OClass oClass = session.createClass("MyClass");
            oClass.createProperty("num", OType.LONG);
            Observable<OElement> elements = Observable.create(emitter -> {
                OLiveQueryMonitor monitor = session.live("select from MyClass", new OLiveQueryResultListener() {
                    @Override
                    public void onCreate(ODatabaseDocument database, OResult data) {
                        emitter.onNext(data.toElement());
                    }

                    @Override
                    public void onUpdate(ODatabaseDocument database, OResult before, OResult after) {
                        emitter.onNext(after.toElement());
                    }

                    @Override
                    public void onDelete(ODatabaseDocument database, OResult data) {

                    }

                    @Override
                    public void onError(ODatabaseDocument database, OException exception) {
                        emitter.onError(exception);
                    }

                    @Override
                    public void onEnd(ODatabaseDocument database) {
                        emitter.onComplete();
                    }
                });
                emitter.setCancellable(monitor::unSubscribe);
            });
            TestObserver<OElement> testObserver = elements
                    .doOnNext((x)-> System.out.println("live query emitted value: " +x))
                    .doOnError((x)-> System.out.println("got error in live query: "+x))
                    .test();
            session.begin();
            for (int i = 1;i<1002;i++) {
                try {
                    OElement element = session.newElement("MyClass");
                    element.setProperty("num", i);
                    element.save();
                    session.commit();
                    Thread.sleep(1000);
                    System.out.println("saved element");
                } catch (Exception e) {
                    System.out.println(e.getMessage());
                    System.out.println(e.getCause());
                }
            }

            testObserver.awaitCount(1000, BaseTestConsumer.TestWaitStrategy.SLEEP_10MS, 1000*60*15)
                    .assertValueCount(1000)
                    .assertValueAt(0, e -> e.getProperty("num").equals(1L));

            Completable.fromAction(testObserver::dispose)
                    .blockingAwait();
        }
    }

output before restarting the node:

saved element
live query emitted value: MyClass#28:15{num:46} v1
saved element
live query emitted value: MyClass#30:15{num:47} v1
saved element
live query emitted value: MyClass#31:15{num:48} v1
saved element
live query emitted value: MyClass#28:16{num:49} v1
saved element
live query emitted value: MyClass#30:16{num:50} v1
saved element
live query emitted value: MyClass#31:16{num:51} v1
saved element
live query emitted value: MyClass#28:17{num:52} v1
saved element
live query emitted value: MyClass#30:17{num:53} v1

output after restarting the node , we can see that there was a problem in saving a record and the live query stops emitting values:

Jun 28, 2020 11:19:55 AM com.orientechnologies.common.log.OLogManager log
INFO: Caught Network I/O errors on 10.55.136.177:2424/testDb, trying an automatic reconnection... (error: null)
Error during saving of record with rid #-1:-1
    DB name="testDb"
com.orientechnologies.common.io.OIOException
got error in live query: com.orientechnologies.orient.core.exception.ODatabaseException: Live query disconnection 
Error during saving of record with rid #-1:-1
    DB name="testDb"
com.orientechnologies.common.io.OIOException: Error on connecting to 172.25.0.2:2424/testDb
saved element
saved element
saved element
saved element
saved element
saved element
saved element
saved element
bulanan commented 4 years ago

Hi @laa, any update on this?