orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.73k stars 869 forks source link

ODB2.2.20 NullPointerException during graph.shutdown #7576

Closed pinakdeshpande closed 3 years ago

pinakdeshpande commented 7 years ago

OrientDB Version: 2.2.20

Java Version: 1.8.131

OS: Linux

Expected behavior

Should be able to commit and shutdown the OrientGraph after creating two vertex and an edge between it.

Actual behavior

Getting null pointer exception during shutdown

Thread 4 " java.lang.NullPointerException
        at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.makeActive(OrientBaseGraph.java:362)
        at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.shutdown(OrientBaseGraph.java:1145)
        at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.shutdown(OrientBaseGraph.java:1138)
        at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.shutdown(OrientBaseGraph.java:1131)

Just to report this nullPointerException ODB took 30+ mins and soon after that started getting below error

        java.lang.IllegalStateException: Pool is closed
        at com.orientechnologies.orient.core.db.OPartitionedDatabasePool.checkForClose(OPartitionedDatabasePool.java:370)
        at com.orientechnologies.orient.core.db.OPartitionedDatabasePool.acquire(OPartitionedDatabasePool.java:176)
        at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.<init>(OrientBaseGraph.java:143)
        at com.tinkerpop.blueprints.impls.orient.OrientTransactionalGraph.<init>(OrientTransactionalGraph.java:77)
        at com.tinkerpop.blueprints.impls.orient.OrientGraph.<init>(OrientGraph.java:135)
        at com.tinkerpop.blueprints.impls.orient.OrientGraphFactory$1.getGraph(OrientGraphFactory.java:84)
        at com.tinkerpop.blueprints.impls.orient.OrientGraphFactory.getTx(OrientGraphFactory.java:221)
        at com.gc.nmxstore.storage.orientdb.OrientDBGraphConnector.getConnection(OrientDBGraphConnector.java:61

Steps to reproduce

Create two vertex Service and ServiceStatistics Service has ServiceId(long) and ServiceName(String) properties ServiceStatistics also has ServiceId(long) and ServiceName(String) properties plus 15 properties for storing actual statistics. Service Id & Name is unique. Gather stats for 100K services every 5 mins and store them in ServiceStatistics vertex creating an edge to unique Service to which it belongs to. One transaction does create ServiceStatistics, Service(if it is not yet created) and a edge between them. The transaction is committed and graph is shutDown.

robfrank commented 7 years ago

Can you upgrade to 2.2.24 and re-test? Moreover, to reproduce the problem we need the code. Not your production code, but some code that mimics the production code. The best way is to provide a self-contained main or, better, a test case. Thanks in advance.

pinakdeshpande commented 7 years ago

I will upgrade and let you know

pinakdeshpande commented 7 years ago

Here is the sample code, which inserts around 100K service statistics every 5 mins and creates an Edge towards unique Service vertex to which that stats belong to.

public class OrientDBGraphTestClient {

    private static Log log = new Log("OrientDBTestClient");

    public static Map<String, String> serviceRidMap = new HashMap<String, String>();

    public OrientDBGraphTestClient(){
    }

    public static AtomicInteger atomicCount = new AtomicInteger(1);

    public static void main(String[] args) throws Exception{

        int threadCnt = Integer.parseInt(args[0]);
        OrientDBGraphTestClient client = new OrientDBGraphTestClient();

        System.out.println(Thread.currentThread().getName() + " started @ " + new java.util.Date());
        System.out.println("");
        try {

            Thread[] threads = new Thread[threadCnt];
            for (int i = 0; i < threadCnt; i++) {
                Thread loadStatsThread = new Thread(client.new LoadTask(Integer.parseInt(args[1])));
                loadStatsThread.setName("LoadTask:" + (i + 1));
                loadStatsThread.start();
                threads[i] = loadStatsThread;
            }
        }
        catch(Exception e){
            e.printStackTrace();
        }
    }

    private class LoadTask implements Runnable{

        public int count = 0;
        OrientGraph graph;
        OrientGraphFactory factory;

        public LoadTask(int count){

            factory = new OrientGraphFactory("remote:<hostname>:2424/<dbName>").setupPool(1,2);
            graph = factory.getTx();            

            this.count = count;
        }

        public void run(){
            long start = System.currentTimeMillis();
            try{
                for(int i = 0; i < count; ++ i){
                    storeStatsInDB(graph, atomicCount.incrementAndGet() +"");
                }
            }
            catch(Exception e){
                log.println("Error in LoadTask : " + e.getMessage());
                e.printStackTrace();
            }
            finally {
                graph.shutdown();

                System.out.println(Thread.currentThread().getName() + " loaded: " + count + " services in: "  + (System.currentTimeMillis() - start) + "ms, thread ended @ " + new java.util.Date());
                System.out.println("");
            }
        }
    }

    public Iterable<Vertex> executeQuery(OrientGraph graph, String query) throws Exception{

        OSQLSynchQuery<OrientVertex> qr = new OSQLSynchQuery<OrientVertex>(query);
        return graph.command(qr).execute();
    }

    public String getVertexId(OrientGraph graph, String className, String propertyName, Object propertyValue) throws Exception{

        long starttime = System.currentTimeMillis();

        String query = "select from " + className + " where " + propertyName + " = '" + propertyValue + "'";
        Iterable<com.tinkerpop.blueprints.Vertex> vertices = executeQuery(graph, query);

        if(vertices != null){
            for(Vertex v: vertices){
                if(v.getProperty(propertyName) == null)
                    continue;

                if(propertyValue.toString().equals(v.getProperty(propertyName).toString())){
                    //log.println(className + " already exists with " + propertyName + ": " + propertyValue);
                    //log.println("Exiting getVertexId " + (System.currentTimeMillis() - starttime) + " ms");
                    return v.getId().toString();
                }
            }
        }

        //log.println("Exiting getVertexId " + (System.currentTimeMillis() - starttime) + " ms");
        return null;

    }

    public Vertex getVertex(OrientGraph graph, String rid) throws Exception{

        long starttime = System.currentTimeMillis();

        String query = "select from " + rid;
        Iterable<com.tinkerpop.blueprints.Vertex> vertices = executeQuery(graph, query);

        if(vertices != null){
            for(Vertex v: vertices){
                //log.println("Exiting getVertexByRid " + (System.currentTimeMillis() - starttime) + " ms");
                return v;
            }
        }

        //log.println("Exiting getVertexByRid " + (System.currentTimeMillis() - starttime) + " ms");
        return null;

    }

    private String getServiceVertexRid(OrientGraph graph, String serviceName) throws Exception{

        String serviceRid = getVertexId(graph, "ServiceVertex", "serviceName",serviceName);

        String rid = serviceRidMap.get(serviceName);

        if(serviceRid != null && rid == null){
            serviceRidMap.put(serviceName, serviceRid);         
        }

        return serviceRid;
    }

    public void storeStatsInDB(OrientGraph graph, String id) throws Exception{

        try{
            long start = System.currentTimeMillis();    

            OrientVertex vertex = graph.addVertex("class:ServiceStatisticsVertex");

            Map<String, Object> propertyMap = new HashMap<String, Object>();

            propertyMap.put("serviceName", "Service:" + id);
            propertyMap.put("serviceId", id);
            propertyMap.put("timestamp", System.currentTimeMillis());

            propertyMap.put("prop1", "1000");
            propertyMap.put("prop2", "2000");
            propertyMap.put("prop3", "3000");
            propertyMap.put("prop4", "4000");
            propertyMap.put("prop5", "5000");
            propertyMap.put("prop6", "6000");
            propertyMap.put("prop7", "7000");
            propertyMap.put("prop8", "8000");
            propertyMap.put("prop9", "9000");
            propertyMap.put("prop10", "10000");

            vertex.setProperties(propertyMap);

            OrientVertex serviceV = null;

            String rid = getServiceVertexRid(graph, "Service:" + id);
            if(rid != null)
                serviceV = (OrientVertex)getVertex(graph, rid);

            if(serviceV == null) {
                serviceV = graph.addVertex("class:ServiceVertex");

                Map<String, Object> props = new HashMap<String, Object>();

                props.put("serviceName", "Service:" + id);
                props.put("serviceId", id + "");

                serviceV.setProperties(props);
            }

            OrientEdge e = graph.addEdge("class:" + "ServiceStatsEdge", vertex, serviceV, "ServiceStatsEdge");
            e.setProperty("timestamp", System.currentTimeMillis());

            graph.commit();
        }
        catch(Exception e){
            log.println("Exception :" + e.getMessage());
            e.printStackTrace();
        }
    }
}
pinakdeshpande commented 7 years ago

Our PROD (which we migrated to 2.2.20 last week) ODB crashed which had around 350GB of real time stats data, There were no logs why it crashed, we could start the server but were not able to connect to the DB. So we had to lose entire data and had to migrate to the latest version 2.2.24. I will watch the system for few weeks and see if we are getting same problems reported in #7576 and #7577.

pinakdeshpande commented 7 years ago

We just started getting issue again this time in checkLowDiskSpaceFullCheckpointRequestsAndBackgroundDataFlushExceptions()

Here are the ODB config params

/bin/java -d64 -server -Xms2G -Xmx2G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.diskCache.bufferSize=26624 -Djava.util.logging.config.file=/opt/orientdb/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/www -Dorientdb.build.number=2.2.x@rd3309b6660d38e14390344a1fa98a43ade2e2073; 2017-07-20 15:46:56+0000 -cp /opt/orientdb/lib/orientdb-server-2.2.24.jar:/opt/orientdb/lib/*:/opt/orientdb/plugins/* com.orientechnologies.orient.server.OServerMain

Stack Trace

[ORecordSerializerBinary]Internal server error:

java.lang.NullPointerException
        at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostKillDbConnection.doPost(OServerCommandPostKillDbConnection.java:51)
        at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostKillDbConnection.execute(OServerCommandPostKillDbConnection.java:38)
        at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:169)
        at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:621)
        at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)
Exception during data flush
java.lang.OutOfMemoryError: GC overhead limit exceeded
$ANSI{green {db=<dbName>}} Exception `1EE8F413` in storage `<dbName>`
com.orientechnologies.orient.core.exception.OStorageException: Error in data flush background thread, please restart database and send full stack trace inside of bug report
        DB name="<dbName>"
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.checkLowDiskSpaceFullCheckpointRequestsAndBackgroundDataFlushExceptions(OAbstractPaginatedStorage.java:4863)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1568)
        at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:541)
        at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:99)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2891)
        at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2860)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.commit(ONetworkProtocolBinary.java:1399)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:650)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:384)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:207)
        at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
$ANSI{green {db=<dbName>}} Exception `5929AE1F` in storage `<dbName>`
:
pinakdeshpande commented 7 years ago

Attaching jstack output when this error came jstack.txt

andrii0lomakin commented 7 years ago

From the stack trace, it is possible to conclude that it is not low disk space, according to stack trace java.lang.OutOfMemoryError was thrown. Please send generated heap dump for analysis.

pinakdeshpande commented 7 years ago

Attaching the jvm heap dumps before we restarted the DB. The errors shows that the GC is not able recover the objects from heap as its GC Overhead limit exceeded error in logs jstack_19924.txt jmap_heap_19924.txt jmap_histo_19924.txt

Just to reiterate when this issue (#7576 was reported) we were on 2.2.20 but we have already upgraded to 2.2.24 and I'm reporting issues found on 2.2.24 from last 2-3 days.

pinakdeshpande commented 7 years ago

@laa Did you make any analysis on the heap dumps posted before? Do you need any more information?

An update, our server (2.2.24) was pretty unstable with the load that is mentioned in this thread. We worked on multiple options trying to make it stable. To start with we started monitoring the server with JVisualVM with remote JMX connection and just within 1hour or so it failed with with java.lang.OutOfMemoryError: GC overhead limit exceeded. Then we disabled the WAL as mentioned in Performance Tuning section in Orient DB Docs, which claimed that it will improve the write speed by lot of margin. But we didnt observe that, hence enabled it again.

Following is our current configs

  1. Increased heap from 2GB to 4GB. (with 26GB as disk space)
  2. WAL is enabled
  3. GC type was ParallelGC before, changed it to G1GC.
  4. Increased a RAM on the server machine to give sufficient memory for OS and other processes to run normally. With these configs, it is better than previous but still the write speed is fluctuating a lot. For 100K service statistics mentioned in the sample code above, its taking anywhere between 1:30 mins to 8 mins and for our requirements anything more than 2.30 or 3 mins is going to be problem.

Could you please suggest any other configs which help stabilize the server reads and writes.

pinakdeshpande commented 7 years ago

Would also really like to know what was the issue with 2.2.20 version and why it was failing with "NullPointerException during graph.shutdown", which is still unknown. What was the issue which caused this, and was it fixed in 2.2.24, if yes, can you please point us to that BUG?

andrii0lomakin commented 7 years ago

Hi @pinakdeshpande , I have looked at your heap dump report and found that all your memory is filled by character objects. Could you find gc roots of this character objects , you can use visualvm for that https://cdivilly.wordpress.com/tag/visualvm/ or you can send heap dump to us so we will analyze it ourselves. I suppose that problem with your performance is big GC overhead caused by big consumption of memory during query processing but we need to check that.

pinakdeshpande commented 7 years ago

I dont have a jvm core dump of the server when we took heap dump which was attached above. Here are today's reports. I also have taken core dump (.hprof) but that file is almost 3GB. Let me know how to share that file with you. I tried to open that with jVisialVM but it didnt open in it. I have MAT tool to analyze the core dump. Attaching some screen shots from it. Let me know what else would you need.

jmap_histo_23266.txt jstack_23266.txt jmap_heap_23266.txt histogram directbytebuffer dominator_tree char_array

taburet commented 7 years ago

Hi @pinakdeshpande, you may use Google Drive to upload the heap dump.

taburet commented 7 years ago

@pinakdeshpande I have ran the provided sample code for about 1.5 hours with 32 threads and 1GB of the heap for server. No memory or CPU consumption anomalies found. Seems like the problem is in some combination of different factors in your setup.

I will provide you with instructions on how to profile your setup soon. In meanwhile, upload your existing heap dump to Google Drive, maybe it will shed some light on the situation.

taburet commented 7 years ago

I think we should stick to this plan:

  1. Analyse the existing heap dump.

  2. If 1 wont help, monitor GC/memory stats and take a new heap dump when the heap is almost full.

  3. If 2 wont help, enable and collect the GC logs as described here and here.

  4. If 3 wont help or if it's obvious that problem is not in GC, profile execution with Honest Profiler.

pinakdeshpande commented 7 years ago

Sure, please share the correct server configs to tune it better. I have uploaded the core dump on google drive and shared you a link.

taburet commented 7 years ago

The heap dump contains an extremely long string of 538 968 062 chars in length, this string along occupies 1GB+ of the heap. Unfortunately, the string has no strong references, so I cannot track its origin. It starts with "service_state[SPTimerV4]", seems like the rest of it is garbage.

Please inspect your code base and data sources for occurrences of this string. I suspect it either parsed incorrectly in the code or the data source contains some invalid inputs. Another possibility is that it was incorrectly stored/loaded from the database, but let's find the place from where the string is came from before.

pinakdeshpande commented 7 years ago

We had some issue with the server and we had to restart it. I have taken a core dump before the restart (pid: 23266) and uploaded on the drive. There is one more strange thing observed, every time after DB restart, within first 5 mins itself the server throws OOM Heap Space error and generates heap dump, I have also uploaded that (pid:22699) and the GC logs which were already enabled.
I will check if I get any references to "service_state" string you were mentioning.

taburet commented 7 years ago

Analysed the latest heap dump (pid:22699). Same story, but there are many such extremely long strings this time and luckily there are strong references to them. All of the strings are came from the network layer. The common pattern is "some meaningful prefix \n many megabytes of garbage (\uFFFFs)". Some examples: "vfs.fs.discovery\n 1GB of garbage", "agent.hostname\n 0.5GB of garbage", "kernel.maxproc\n 0.3GB of garbage", etc.

Please check the strings being sent to the server from your side, we will check the receiving part on our side.

taburet commented 7 years ago

BTW, the strings came by the HTTP connections.

taburet commented 7 years ago

@pinakdeshpande are you using HTTP connections to the OrientDB server?

pinakdeshpande commented 7 years ago

Yes for writing we use java APIs and for reading the data I'm firing queries from the REST HTTP URL Connection. But I dont find those characters in any of my query strings. All those strings points me to Zabbix monitoring system which is enabled on PROD environment, I will check with our IT Team and get back to you.

I will also post the sample query that I'm using, even those queries are not responding consistently.

select from (traverse in(ServiceStatsEdge) from (select from Service where serviceId >= 100 and serviceId <= 200)) where @class = 'ServiceStatisticsVertex' and timeStamp >= ?

here timeStamp = sysDate() - 15 mins

We have a NON_UNIQUE composite index on serviceId and timeStamp.

pinakdeshpande commented 7 years ago

I could also see those strings in the dump. The Leak Analysis Report says following

The thread com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpDb @ 0x6cc0332b0 OrientDB HTTP Connection /<host1>:2480<-/<host2>:55121 keeps local variables with total size 1,077,941,344 (29.38%) bytes. The memory is accumulated in one instance of "char[]" loaded by "<system class loader>". The stacktrace of this Thread is available. See stacktrace.

Does that mean ONetworkProtocolHttpDb class is not cleaning up those strings properly and keep accumulating them locally?

andrii0lomakin commented 7 years ago

Sorry we are storage engine team not http protocol team, it was clear from begging that it is not related to storage but we decided to help to find a cause, assigned to @maggiolo00

pinakdeshpande commented 7 years ago

@laa , @taburet Thanks a lot for helping us here. :)

We have disabled the external monitoring now. It has released all those char[] in the heap.

I'm still seeing following in the heap dump. Is this expected? "com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache" loaded by "sun.misc.Launcher$AppClassLoader @ 0x6c0002300" occupies 222,567,512 (11.24%) bytes. The memory is accumulated in one instance of "com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache" loaded by "sun.misc.Launcher$AppClassLoader @ 0x6c0002300".

andrii0lomakin commented 7 years ago

Hi @pinakdeshpande , that is @taburet who helped you )) thank you for feedback. Yes, havingO2Cache which occupies 11.24% is totally OK. In future releases, we decrease those numbers.

andrii0lomakin commented 7 years ago

@pinakdeshpande do you have any issues at the moment?

pinakdeshpande commented 7 years ago

No on the storage side, i guess. But the query responses are not very stable and consistent and would really need some help on that. Should I post my questions here on Git?

wolf4ood commented 6 years ago

hi @pinakdeshpande

do you still need help on this?