data loading performance seriously degrading if cluster/index files become very big

rdelangh commented 7 years ago

OrientDB Version, operating system, or hardware.

v2.2.15

Operating System

[x] Linux

We create classes per week of data, with 6 pre-created clusters per class. We store some 120M documents per such class, which translates into sizes of the cluster files on the server to about 20GB per cluster file. For example, for the class of week starting on Jan-22 2017:

% ls -l *20170122*pcl
-rw-rw-r-- 1 orientdb orientdb 21713257472 Jan 31 10:53 cl_cdr_eno_20170122_p0.pcl
-rw-rw-r-- 1 orientdb orientdb 21713257472 Jan 31 10:53 cl_cdr_eno_20170122_p1.pcl
-rw-rw-r-- 1 orientdb orientdb 21713060864 Jan 31 10:41 cl_cdr_eno_20170122_p2.pcl
-rw-rw-r-- 1 orientdb orientdb 21713257472 Jan 31 10:43 cl_cdr_eno_20170122_p3.pcl
-rw-rw-r-- 1 orientdb orientdb 21713060864 Jan 31 10:41 cl_cdr_eno_20170122_p4.pcl
-rw-rw-r-- 1 orientdb orientdb 21713454080 Jan 31 10:53 cl_cdr_eno_20170122_p5.pcl

Similarly the index files are obviously also fairly big:

% ls -l idx*20170122*
-rw-rw-r-- 1 orientdb orientdb        1024 Jan 21 13:43 idx_cdr_eno_20170122_0.nbt
-rw-rw-r-- 1 orientdb orientdb 40604402688 Jan 31 10:57 idx_cdr_eno_20170122_0.sbt

Compared to the files for class with data that started only 2 days ago on Jan-29 2017:

% ls -l *20170129*pcl
-rw-rw-r-- 1 orientdb orientdb 3291808768 Jan 30 17:03 cl_cdr_eno_20170129_p0.pcl
-rw-rw-r-- 1 orientdb orientdb 3291874304 Jan 30 16:50 cl_cdr_eno_20170129_p1.pcl
-rw-rw-r-- 1 orientdb orientdb 3291874304 Jan 30 16:51 cl_cdr_eno_20170129_p2.pcl
-rw-rw-r-- 1 orientdb orientdb 3291874304 Jan 30 16:51 cl_cdr_eno_20170129_p3.pcl
-rw-rw-r-- 1 orientdb orientdb 3291874304 Jan 30 17:02 cl_cdr_eno_20170129_p4.pcl
-rw-rw-r-- 1 orientdb orientdb 3291808768 Jan 30 17:02 cl_cdr_eno_20170129_p5.pcl
% ls -l idx*20170129*
-rw-rw-r-- 1 orientdb orientdb       1024 Jan 22 00:09 idx_cdr_eno_20170129_0.nbt
-rw-rw-r-- 1 orientdb orientdb 6029509632 Jan 30 17:03 idx_cdr_eno_20170129_0.sbt

The loading of extra documents in such a class, hence with fairly large cluster or index files, is going an order of magnitude slower compared to loading of documents into a class which has not many documents yet and consequently whose files are still much smaller:

loading into class "cdr_eno_20170122" takes approx 30msec per document
loading into class "cdr_eno_20170129" takes approx 1msec per document

Anything that can be tuned on this?

Obviously, the cluster files can be reduced in size by creating more clusters per class, so less data per cluster. This would break our cluster-sharding logic, would imply a lot of changes, but it is not impossible. About the index files: this is internal ODB business, where only a single file contains an entire index, and hence will always grow very big.

All suggestions are very welcome!

andrii0lomakin commented 7 years ago

@rdelangh I suppose the problem not in size of a cluster but in general size of a database. What is size of while database in GB what is size of RAM which is used for disk cache ?

andrii0lomakin commented 7 years ago

The next question does you load data in clusters only with indexes, or you load them in a separate manner without indexes? One more question what is your pattern of a load of data do you load them like one cluster by one cluster or many clusters are loaded at the same time.

rdelangh commented 7 years ago

@laa Q1: size of this database is more or less 500GB Q2: size of RAM: 128GB Q3: size of disk cache: if you mean the parameter "storage.diskCache.bufferSize": it is set to "32000" Q4: data is being loaded while the indexes exist, because there is also a unique index which needs to avoid that we load duplicates Q5: we load data into the class, not into specific clusters; so I assume that ODB is spreading the data automagically into the 6 clusters of the class

We are now rebuilding the unique index of this class, because we did a lot of deletes during the past days (issues with ODB, so that data loading encountered many timeouts after partial load jobs, hence new loads fails because of duplicates that first need to be deleted). I suspect that the unique index file (the one of about 40GB size) has many 'gaps' in it which cause the very slow loading now of new records.

andrii0lomakin commented 7 years ago

Hi, I suggest you do following as the first thing. Increase parameter "storage.diskCache.bufferSize" from 32000 to 82000 and I think you will see big difference if of course you use dedicated server.

Any way what was the reason to set buffer size manually?

rdelangh commented 7 years ago

@laa thx for you comments.

The buffer size was already adjusted during our first (and failing) attempts to get such a big database running with ODB, as per comments from your collegue fellows ;-)

The servers are dedicated, but we need to reserve a good chunk of its RAM for the ZFS filesystem manager caching that we have running. We can raise further the ODB buffersize to -say- 45% of total RAM and leave another 45% for the ZFS caching.

The index rebuild is now busy, we will wait first on that completion to see if it has any effect on performance.

rdelangh commented 7 years ago

hmm ... :

the server is entirely idle, no single process is running apart from the ODB server
128GB RAM installed, of which now 2GB RAM free and 91GB RAM used by cache

index rebuilding is nearly granted to a halt:

...
2017-02-01 12:26:16:622 INFO  {db=mobile} --> 73.90% progress, 94,155,597 indexed so far (46 items/sec) [OIndexRebuildOutputListener]
2017-02-01 12:26:26:636 INFO  {db=mobile} --> 73.90% progress, 94,156,110 indexed so far (51 items/sec) [OIndexRebuildOutputListener]

index file is about 27GB size at this moment:

% ls -lh idx*20170122*
-rw-rw-r-- 1 orientdb orientdb 1.0K Jan 31 13:03 idx_cdr_eno_20170122_0.nbt
-rw-rw-r-- 1 orientdb orientdb  27G Feb  1 12:28 idx_cdr_eno_20170122_0.sbt

-> what's going on?

This way, index rebuilding is worthless. So I stop it, and will trash the index altogether, then build it new. See how that goes...

Grr... unpredictable program.

andrii0lomakin commented 7 years ago

@rdelangh please do not do it, just send me thread dump, I suppose you just reached free space limit for the program

andrii0lomakin commented 7 years ago

either wal limit

andrii0lomakin commented 7 years ago

@rdelangh could you also do me a favour and run the same test on ycsb_freeze branch, may be all issue which we are discussing already are fixed and we just need to move them to main branch

rdelangh commented 7 years ago

@laa

thread dump : I tried that method a few times several months ago, and such dump is a few GB in size so cannot be transferred to you practically
what is the "ycsb_branch", please ?

andrii0lomakin commented 7 years ago

@rdelangh I suppose you meant memory dump which may be few Gb. .Thread dumps usually consume few kb at max. The thread dump is a set of text lines which contain information about a current position of execution of each thread. Could you take it by calling of jstack pid command?

rdelangh commented 7 years ago

ok, thread dump in attach. odb-stackdump.txt

andrii0lomakin commented 7 years ago

@rdelangh From what I see you use only two threads to build an index, the rest of threads in "wait" state. You may increase the speed of load if you will build independent clusters in parallel. From thread dump all seems ok, threads are running and handling requests is thread rebuild still halt? What I only see is that index always working with disk so speed of rebuild is very low not fast. What is the size of index ?

andrii0lomakin commented 7 years ago

Oh, I see sorry I missed it. So the size of the index is 27 GB, but your disk cache size is about 91 GB of RAM. I see now. We can create command to put all index content to ram or almost all to rebuild it faster

andrii0lomakin commented 7 years ago

@rdelangh so I propose following solution we may give index higher priority in cache for the time while they will be rebuilt. So is it ok for you if you will create clusters first and then rebuild indexes? Did you try to create clusters without indexes? What was a speed of load in such case?

andrii0lomakin commented 7 years ago

@rdelangh btw what is the type the keys of index?

rdelangh commented 7 years ago

@laa thx for your follow-up

new classes are created like this: first create the underlying 6 clusters, then create class with these clusters, then define properties in the class, then create indexes (there is still no data in the class)
it is no option to create the indexes after the data has been loaded, because one of the indexes -a unique index- is by purpose defined upfront to block duplicate data from being loaded
what is going on for us now, is a class of 6 existing clusters of 11GB each, of which I try to rebuild a previously defined unique index; it's this rebuilding of the index that is now running dramatically slow

What you mean with the remark "we may give index higher priority in for the time while they will be rebuilt."?

The keys of this unique index: ENV, ORIGTERMMSCID, CALLTRANSACTIONTYPE, SERVEDIMSI, PRT_DATE, OTHERPARTYNUMBER, CALLDURATION, EXCHANGEID, OTHERPARTYSMS, ACCESSORYNUMBER, SEQUENCENUMBER All these are STRING, except PRT_DATE which is a DATETIME

andrii0lomakin commented 7 years ago

@rdelangh I see. Index rebuild is so slow because you randomly fetch pages from index file. So they treated as cold by a cache (because each page is fetched only a few times) and cache does not want to put it in memory. But I need to understand everything completely.

Is it correct that you increased the size of disk cache but did not try to make new load instead of that you tried to rebuild an index, do not you?

Did you rebuild index once new server is started or you rebuild the index after you made several other operations on an index and then started index rebuild?

andrii0lomakin commented 7 years ago

@rdelangh one more question what alphabet do you use for your keys, is it Latin alphabet?

andrii0lomakin commented 7 years ago

and one more question, do you use tree indexes or hash indexes do you need to sort your keys ?

rdelangh commented 7 years ago

@laa Q1: Is it correct that you increased the size of disk cache but did not try to make new load instead of that you tried to rebuild an index, do not you? A1: no, the disk cache size was set like this during our installation of the software, and has not been changed since Q2: Did you rebuild index once new server is started or you rebuild the index after you made several other operations on an index and then started index rebuild? A2: the latter: many (very fast) load jobs were done, many records deleted, then again (very slow) load jobs; then we stopped all activity, and launch the "rebuild" operation from the web gui Q3: what alphabet do you use for your keys, is it Latin alphabet? A3: yes, Latin, normal ASCII, no special characters (such as "é" or "ç") Q4: do you use tree indexes or hash indexes do you need to sort your keys ? A4: the unique index that we are trying to rebuild now, is a (SBTREE) UNIQUE index:

CREATE INDEX idx_cdr_eno_20170122_0 ON cdr_eno_20170122(ENV, ORIGTERMMSCID, CALLTRANSACTIONTYPE, SERVEDIMSI, PRT_DATE, OTHERPARTYNUMBER, CALLDURATION, EXCHANGEID, OTHERPARTYSMS, ACCESSORYNUMBER, SEQUENCENUMBER) UNIQUE

andrii0lomakin commented 7 years ago

@rdelangh if your value for "storage.diskCache.bufferSize" equals to 32 000 why do you think that disk cache consumes 92 GB? What is your value for heap memory? The second question why you do not use hash indexes, do you need to perform range queries?

rdelangh commented 7 years ago

@laa I did not write that the 92GB is "disk cache", but instead that it is "cache". At this moment, it is some 48GB:

% vmstat 4

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2 302036 42384032  11068 49863356    0    0   234   105    2    1  1  0 97  2  0
 0  2 302036 42395128  11068 49852360    0    0 16532   924 2280 7057  0  0 96  4  0
...

Yes, range queries are needed, so SBTREE index.

andrii0lomakin commented 7 years ago

@rdelangh what is your settings for heap memory then ?

andrii0lomakin commented 7 years ago

@rdelangh I suggest you increase disk cache and decrease amount of heap memory accordingly. Also, we will provide you with an implementation of optional compression of index keys, so you will be able to cache more memory. Also about slow index rebuild. You made many actions, so disk cache already gathered statistics about hot pages. Because index pages which you are going to rebuild are accessed randomly, and very rarely, disk cache treats them as cold pages and do not want to cache it. But the bigger amount of memory for disk cache may also help in this situation. So, in general, I wait for your feedback after you increase disk cache size on 45%. Meanwhile, once we start to work on this issue, we implement string compression for you. So indexes will consume less space and will be frequently cached.

rdelangh commented 7 years ago

@laa ok, thx for the feedback. I changed the params as follows: -Xms1G -Xmx8G -Dstorage.diskCache.bufferSize=48000

Stopped the server, then started it again. Logs say now this:

2017-02-01 16:42:41:900 INFO  [OVariableParser.resolveVariables] Error on resolving property: distributed [orientechnologies]
2017-02-01 16:42:41:903 WARNI Authenticated clients can execute any kind of code into the server by using the following allowed languages: [sql] [OServ
erSideScriptInterpreter]
2017-02-01 16:42:41:905 INFO  OrientDB Studio available at http://10.49.69.190:2480/studio/index.html [OServer]
2017-02-01 16:42:41:905 INFO  OrientDB Server is active v2.2.15 (build 2.2.x@rebf7b80c08613e871633edeb39c5c985e26b176c; 2017-01-19 07:39:00+0000). [OSe
rver]
2017-02-01 16:43:05:182 WARNI {db=mobile} Storage 'mobile' was not closed properly. Will try to recover from write ahead log [OLocalPaginatedStorage]
2017-02-01 16:43:05:182 INFO  {db=mobile} Looking for last checkpoint... [OLocalPaginatedStorage]
2017-02-01 16:43:05:183 INFO  {db=mobile} FULL checkpoint found. [OLocalPaginatedStorage]
2017-02-01 16:43:05:183 WARNI {db=mobile} FULL checkpoint has not completed. [OLocalPaginatedStorage]
2017-02-01 16:43:05:184 WARNI {db=mobile} Restore will start from the previous checkpoint. [OLocalPaginatedStorage]
2017-02-01 16:43:05:185 INFO  {db=mobile} Data restore procedure from FUZZY checkpoint is started. [OLocalPaginatedStorage]
2017-02-01 16:43:05:188 INFO  {db=mobile} 1 operations were processed, current LSN is LSN{segment=31008, position=16} last LSN is LSN{segment=31024, po
sition=85521596} [OLocalPaginatedStorage]
2017-02-01 16:43:05:189 WARNI {db=mobile} Non tx operation was used during data modification we will need index rebuild. [OLocalPaginatedStorage]
2017-02-01 16:43:35:194 INFO  {db=mobile} 9795 operations were processed, current LSN is LSN{segment=31008, position=2640565} last LSN is LSN{segment=3
1024, position=85521596} [OLocalPaginatedStorage]
...
2017-02-01 17:51:50:444 INFO  {db=mobile} 2110000 operations were processed, current LSN is LSN{segment=31012, position=28131746} last LSN is LSN{segme
nt=31024, position=85521596} [OLocalPaginatedStorage]
2017-02-01 17:51:57:233 WARNI {db=mobile} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=31012, position=29736269}} com.orientechnologies.orient.co
re.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=31011, position=97153164}} will be skipped dur
ing data restore [OLocalPaginatedStorage]
2017-02-01 17:51:57:404 WARNI {db=mobile} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{s
egment=31012, position=29841942}} will be skipped during data restore [OLocalPaginatedStorage]
2017-02-01 17:52:02:118 INFO  {db=mobile} 2120000 operations were processed, current LSN is LSN{segment=31012, position=30845999} last LSN is LSN{segme
nt=31024, position=85521596} [OLocalPaginatedStorage]
...

(still busy, will report again when completed)

rdelangh commented 7 years ago

update:

...
2017-02-01 18:38:46:953 INFO  {db=mobile} 8390000 operations were processed, current LSN is LSN{segment=31024, position=83146427} last LSN is LSN{segment=31024,
 position=85521596} [OLocalPaginatedStorage]
2017-02-01 18:38:48:670 SEVER {db=mobile} Invalid WAL record type was passed class com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFileTrunc
atedWALRecord. Given record will be skipped. [OLocalPaginatedStorage]
2017-02-01 18:38:48:670 SEVER {db=mobile} Invalid WAL record type was passed class com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFileTrunc
atedWALRecord. Given record will be skipped. [OLocalPaginatedStorage]
2017-02-01 18:38:48:671 WARNI {db=mobile} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFullCheckpointStartRecord{lsn=LSN{segment=3
1024, position=85521573}, previousCheckpoint=LSN{segment=31024, position=77850119}} will be skipped during data restore [OLocalPaginatedStorage]
2017-02-01 18:38:48:671 INFO  {db=mobile} Storage data recover was completed [OLocalPaginatedStorage]
2017-02-01 19:06:25:810 INFO  {db=mobile} Wait till indexes restore after crash was finished. [OIndexManagerShared]
2017-02-01 19:06:25:948 INFO  {db=mobile} Start creation of index 'OFunction.name' [OIndexManagerShared$RecreateIndexesTask]
...

and

% grep 'rebuild is going' /opt/orientdb/orientdb-community-2.2.15/log/orient-server.log.0
2017-02-01 19:06:26:056 INFO  {db=mobile} Index 'OFunction.name' was successfully created and rebuild is going to be started [OIndexManagerShared$RecreateIndexe
sTask]
2017-02-01 19:06:32:482 INFO  {db=mobile} Index 'idx_cdr_eno_20161218_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-01 22:07:21:218 INFO  {db=mobile} Index 'ORole.name' was successfully created and rebuild is going to be started [OIndexManagerShared$RecreateIndexesTas
k]
2017-02-01 22:09:29:476 INFO  {db=mobile} Index 'idx_cdr_eno_20161225_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 00:31:20:772 INFO  {db=mobile} Index 'idx_cdr_eno_20170108_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 05:20:46:894 INFO  {db=mobile} Index 'idx_cdr_eno_20170108_1' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 06:44:31:174 INFO  {db=mobile} Index 'idx_cdr_eno_20161218_2' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 09:40:33:389 INFO  {db=mobile} Index 'idx_cdr_eno_20170101_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]

Now it is rebuilding ALL of the indexes of ALL databases, apparently !! This will take days!

rdelangh commented 7 years ago

Done. Attempting restart...

rdelangh commented 7 years ago

Restart ok
started pumping files with data to ODB again into dbase "mobile" via 3 parallel programs (sending data via REST API), and another stream of files into dbase "cdrarch" also via 3 parallel programs
was running fine for about 5 hours, and then again all writes to dbase "mobile" were blocked until our REST API gave a timeout (after 200secs) ; I stopped the programs which were feeding this stream, and started one single program again. This runs fine now again.

-> so somehow, something inside ODB gets saturated or accumulated into a situation where ODB can no longer process input data. -> how can this be identified?

rdelangh commented 7 years ago

And now we seem to get one more problem in this never ending saga:

trying to query directly an index, and this gives an error:

select from INDEX:idx_cdr_eno_20170122_1

Gives "java.lang.NullPointerException"

In the server logs:

Internal server error:

java.lang.NullPointerException
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.fetchEntriesFromIndexCursor(OCommandExecutorSQLSelect.java:2466)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.searchInIndex(OCommandExecutorSQLSelect.java:2780)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.assignTarget(OCommandExecutorSQLSelect.java:532)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.executeSearch(OCommandExecutorSQLSelect.java:512)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.execute(OCommandExecutorSQLSelect.java:488)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:74)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2592)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2538)
        at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69)
        at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostCommand.execute(OServerCommandPostCommand.java:106)
        at com.orientechnologies.orient.graph.server.command.OServerCommandPostCommandGraph.execute(OServerCommandPostCommandGraph.java:37)
        at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:169)
        at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:622)
        at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)

Indexes corrupted, again ??

rdelangh commented 7 years ago

this failing index is a LUCENE index; querying another SBTREE UNIQUE index of the same class works fine.

andrii0lomakin commented 7 years ago

@rdelangh could you create a separate entry for Lucene index. About this saturation, it can be identified by making of thread dump during such saturation. Could you make such one ?

andrii0lomakin commented 7 years ago

@rdelangh about this saturation I suppose it is already fixed in a new version but let us check it.

wolf4ood commented 7 years ago

@rdelangh about this

select from INDEX:idx_cdr_eno_20170122_1

it's not supported by the lucene index. You can use it with a filter or count

select from INDEX:idx_cdr_eno_20170122_1 where key ..

select count(*) from INDEX:idx_cdr_eno_20170122_1

Since it's an inverted index, the index iterator has not been implemented

andrii0lomakin commented 7 years ago

@maggiolo00 I think we should provide a meaningful exception instead of NPE. Could you implement it as a separate issue?

wolf4ood commented 7 years ago

@laa done

https://github.com/orientechnologies/orientdb/issues/7156

rdelangh commented 7 years ago

The program which is loading data into dbase "mobile" suffers from these timeouts again.

So I attempted a stack trace:

# /usr/lib/jvm/jdk1.8.0_92/bin/java 29671 > /tmp/29671-jstack.txt
29671: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
# /usr/lib/jvm/jdk1.8.0_92/bin/jstack -F 29671 > /tmp/29671-jstack.txt

File attached: 29671-jstack.txt

rdelangh commented 7 years ago

That data loading into dbase "mobile" is now blocking entirely, while another data loading into dbase "cdrarch" is happily proceeding, so I tried another stack trace:

# /usr/lib/jvm/jdk1.8.0_92/bin/jstack -F 29671 > /tmp/29671-b-jstack.txt
java.lang.NullPointerException
        at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:88)
        at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45)
        at sun.jvm.hotspot.tools.JStack.run(JStack.java:66)
        at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
        at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
        at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
        at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
        at sun.tools.jstack.JStack.main(JStack.java:106)

-> fairly stuck ?

rdelangh commented 7 years ago

meamwhile, the system is not out-of-memory:

# free -m
              total        used        free      shared  buff/cache   available
Mem:         128809       71604       54121          25        3082       55714
Swap:        130943           0      130943
# vmstat 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  2      0 55387960  86404 3089596    0    0   908 18942 5330 10927  1  1 94  4  0
 1  3      0 55434460  86404 3089500    0    0   856 15413 4692 7787  1  1 94  4  0
 1  2      0 55444076  86404 3067696    0    0   902 13750 6073 7681  1  1 94  4  0
 0  2      0 55442068  86404 3068956    0    0  1199  6135 3593 4776  1  0 95  4  0

# ps -aefl 
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S orientdb 29671     1 61  80   0 - 38739983 futex_ 13:04 pts/4 01:36:31 java -server -Xms1G -Xmx8G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true -Dquery.parallelAuto=true -Dquery.parallelMinimumRecords=50000 -Dstorage.wal.syncOnPageFlush=false -Dstorage.diskCache.bufferSize=54000 -Dmemory.chunk.size=33554432 -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.15/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.15/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.15/www -Dorientdb.build.number=2.2.x@rebf7b80c08613e871633edeb39c5c985e26b176c; 2017-01-19 07:39:00+0000 -cp /opt/orientdb/orientdb-community-2.2.15/lib/orientdb-server-2.2.15.jar:/opt/orientdb/orientdb-community-2.2.15/lib/*:/opt/orientdb/orientdb-community-2.2.15/plugins/* com.orientechnologies.orient.server.OServerMain

Quite some open files:

# lsof -p 29671| grep /opt |wc -l
4315

andrii0lomakin commented 7 years ago

@rdelangh could you print and send me list of open files also according to stack trace which you sent to me you hit limit size, could you send me what is your WAL size now? Are you able to take thread dump now ?

rdelangh commented 7 years ago

@laa I have set "dumpCfgAtStartup = true", so at startup I get the following info about "wal" :

storage.useWAL = true
storage.wal.syncOnPageFlush = false
storage.wal.cacheSize = 3000
storage.wal.fileAutoCloseInterval = 10
storage.wal.maxSegmentSize = 128
storage.wal.maxSize = 4096
storage.wal.commitTimeout = 1000
storage.wal.shutdownTimeout = 10000
storage.wal.fuzzyCheckpointInterval = 300
storage.wal.reportAfterOperationsDuringRestore = 10000
storage.wal.restore.batchSize = 1000
storage.wal.readCacheSize = 1000
storage.wal.fuzzyCheckpointShutdownWait = 600
storage.wal.fullCheckpointShutdownTimeout = 600
storage.wal.path = null

Not sure if this is the information about WAL you like to know; if you need other information then please add the instructions that I must use to obtain that information.

Thread dump: likely impossible to take because ODB server process has huge memory footprint (see above, approx 35GB)

rdelangh commented 7 years ago

And the open-files list in attach odb-29671-lsof.txt.gz

andrii0lomakin commented 7 years ago

@rdelangh thank you for a list of open files. The thread dump is not heap dump.The thread dump is a list of stack traces each of which points to the position of execution of a program in the single thread. You have already taken thread dump with a stack. I already looked at your thread dump, the first time OrientDB paused because of you reached a limit of WAL size and free space size that is why I asked about WAL limits. I suppose that you use default settings in which WAL max size is only 4GB once it reached this size synchronous flush of data is performed and as a result, you see pauses in execution. We already changed flush mechanics to avoid those pauses.It is fixed in https://github.com/orientechnologies/orientdb/tree/ycsb_freeze branch. But you need to wait for 3.0 to make load testing of it. I can create a build of the ycsb_freeze branch for you if you wish to check it. Meanwhile, to prevent pauses during load, you may increase max size of WAL (storage.wal.maxSize parameter defines a size of WAL in megabytes) till reasonable value let say 20-50 GB, depends on your disk size. Putting aside pauses how a performance of load changed after you increased disk cache size?

rdelangh commented 7 years ago

@laa The config-dump at startup lists these WAL settings (see above), which I have not changed in the "server.sh" script; I notice indeed this line

storage.wal.maxSize = 4096

andrii0lomakin commented 7 years ago

@rdelangh I updated comment above. What is current status of load , do you still have pause on frozen client ?

rdelangh commented 7 years ago

@laa I only keep the job-stream running which sends data to dbase "cdrarch", no concurrency (only one program is sending data); I have temporarily disabled the parallel running job-streams that send data to dbase "mobile"

-> I will wait till that first stream (most important) has recovered from its backlog; then I will do new attempts with that (blocking) second stream with parallel jobs

rdelangh commented 7 years ago

@laa thx for the follow-up!

which comment did you update?
shall I raise that WAL limit? If so, what do you recommend as new max limit? 8GB?

andrii0lomakin commented 7 years ago

@rdelangh I suppose good one for you is about 16GB or a bit more 20GB what is your disk size? What is your DB size ?

rdelangh commented 7 years ago

@laa

is WAL happening on filesystem, or in RAM ?
filesystem capacity is large enough (few TB)
dbase sizes: "cdrarch" occupies now 539G, "mobile" occupies now 423G

andrii0lomakin commented 7 years ago

@rdelangh WAL is placed on a file system. In such case I suggest you to set WAL size to 50 GB which equals to "storage.wal.maxSize = 51200" it will substantially decrease pauses during data load.

orientechnologies / orientdb

data loading performance seriously degrading if cluster/index files become very big #7135

OrientDB Version, operating system, or hardware.

Operating System