Closed rdelangh closed 7 years ago
@rdelangh I suppose the problem not in size of a cluster but in general size of a database. What is size of while database in GB what is size of RAM which is used for disk cache ?
The next question does you load data in clusters only with indexes, or you load them in a separate manner without indexes? One more question what is your pattern of a load of data do you load them like one cluster by one cluster or many clusters are loaded at the same time.
@laa Q1: size of this database is more or less 500GB Q2: size of RAM: 128GB Q3: size of disk cache: if you mean the parameter "storage.diskCache.bufferSize": it is set to "32000" Q4: data is being loaded while the indexes exist, because there is also a unique index which needs to avoid that we load duplicates Q5: we load data into the class, not into specific clusters; so I assume that ODB is spreading the data automagically into the 6 clusters of the class
We are now rebuilding the unique index of this class, because we did a lot of deletes during the past days (issues with ODB, so that data loading encountered many timeouts after partial load jobs, hence new loads fails because of duplicates that first need to be deleted). I suspect that the unique index file (the one of about 40GB size) has many 'gaps' in it which cause the very slow loading now of new records.
Hi, I suggest you do following as the first thing. Increase parameter "storage.diskCache.bufferSize" from 32000 to 82000 and I think you will see big difference if of course you use dedicated server.
Any way what was the reason to set buffer size manually?
@laa thx for you comments.
The buffer size was already adjusted during our first (and failing) attempts to get such a big database running with ODB, as per comments from your collegue fellows ;-)
The servers are dedicated, but we need to reserve a good chunk of its RAM for the ZFS filesystem manager caching that we have running. We can raise further the ODB buffersize to -say- 45% of total RAM and leave another 45% for the ZFS caching.
The index rebuild is now busy, we will wait first on that completion to see if it has any effect on performance.
hmm ... :
...
2017-02-01 12:26:16:622 INFO {db=mobile} --> 73.90% progress, 94,155,597 indexed so far (46 items/sec) [OIndexRebuildOutputListener]
2017-02-01 12:26:26:636 INFO {db=mobile} --> 73.90% progress, 94,156,110 indexed so far (51 items/sec) [OIndexRebuildOutputListener]
% ls -lh idx*20170122*
-rw-rw-r-- 1 orientdb orientdb 1.0K Jan 31 13:03 idx_cdr_eno_20170122_0.nbt
-rw-rw-r-- 1 orientdb orientdb 27G Feb 1 12:28 idx_cdr_eno_20170122_0.sbt
-> what's going on?
This way, index rebuilding is worthless. So I stop it, and will trash the index altogether, then build it new. See how that goes...
Grr... unpredictable program.
@rdelangh please do not do it, just send me thread dump, I suppose you just reached free space limit for the program
either wal limit
@rdelangh could you also do me a favour and run the same test on ycsb_freeze branch, may be all issue which we are discussing already are fixed and we just need to move them to main branch
@laa
@rdelangh I suppose you meant memory dump which may be few Gb. .Thread dumps usually consume few kb at max. The thread dump is a set of text lines which contain information about a current position of execution of each thread. Could you take it by calling of jstack pid
command?
ok, thread dump in attach. odb-stackdump.txt
@rdelangh From what I see you use only two threads to build an index, the rest of threads in "wait" state. You may increase the speed of load if you will build independent clusters in parallel. From thread dump all seems ok, threads are running and handling requests is thread rebuild still halt? What I only see is that index always working with disk so speed of rebuild is very low not fast. What is the size of index ?
Oh, I see sorry I missed it. So the size of the index is 27 GB, but your disk cache size is about 91 GB of RAM. I see now. We can create command to put all index content to ram or almost all to rebuild it faster
@rdelangh so I propose following solution we may give index higher priority in cache for the time while they will be rebuilt. So is it ok for you if you will create clusters first and then rebuild indexes? Did you try to create clusters without indexes? What was a speed of load in such case?
@rdelangh btw what is the type the keys of index?
@laa thx for your follow-up
What you mean with the remark "we may give index higher priority in for the time while they will be rebuilt."?
The keys of this unique index: ENV, ORIGTERMMSCID, CALLTRANSACTIONTYPE, SERVEDIMSI, PRT_DATE, OTHERPARTYNUMBER, CALLDURATION, EXCHANGEID, OTHERPARTYSMS, ACCESSORYNUMBER, SEQUENCENUMBER All these are STRING, except PRT_DATE which is a DATETIME
@rdelangh I see. Index rebuild is so slow because you randomly fetch pages from index file. So they treated as cold by a cache (because each page is fetched only a few times) and cache does not want to put it in memory. But I need to understand everything completely.
Is it correct that you increased the size of disk cache but did not try to make new load instead of that you tried to rebuild an index, do not you?
Did you rebuild index once new server is started or you rebuild the index after you made several other operations on an index and then started index rebuild?
@rdelangh one more question what alphabet do you use for your keys, is it Latin alphabet?
and one more question, do you use tree indexes or hash indexes do you need to sort your keys ?
@laa Q1: Is it correct that you increased the size of disk cache but did not try to make new load instead of that you tried to rebuild an index, do not you? A1: no, the disk cache size was set like this during our installation of the software, and has not been changed since Q2: Did you rebuild index once new server is started or you rebuild the index after you made several other operations on an index and then started index rebuild? A2: the latter: many (very fast) load jobs were done, many records deleted, then again (very slow) load jobs; then we stopped all activity, and launch the "rebuild" operation from the web gui Q3: what alphabet do you use for your keys, is it Latin alphabet? A3: yes, Latin, normal ASCII, no special characters (such as "é" or "ç") Q4: do you use tree indexes or hash indexes do you need to sort your keys ? A4: the unique index that we are trying to rebuild now, is a (SBTREE) UNIQUE index:
CREATE INDEX idx_cdr_eno_20170122_0 ON cdr_eno_20170122(ENV, ORIGTERMMSCID, CALLTRANSACTIONTYPE, SERVEDIMSI, PRT_DATE, OTHERPARTYNUMBER, CALLDURATION, EXCHANGEID, OTHERPARTYSMS, ACCESSORYNUMBER, SEQUENCENUMBER) UNIQUE
@rdelangh if your value for "storage.diskCache.bufferSize" equals to 32 000 why do you think that disk cache consumes 92 GB? What is your value for heap memory? The second question why you do not use hash indexes, do you need to perform range queries?
@laa I did not write that the 92GB is "disk cache", but instead that it is "cache". At this moment, it is some 48GB:
% vmstat 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 2 302036 42384032 11068 49863356 0 0 234 105 2 1 1 0 97 2 0
0 2 302036 42395128 11068 49852360 0 0 16532 924 2280 7057 0 0 96 4 0
...
Yes, range queries are needed, so SBTREE index.
@rdelangh what is your settings for heap memory then ?
@rdelangh I suggest you increase disk cache and decrease amount of heap memory accordingly. Also, we will provide you with an implementation of optional compression of index keys, so you will be able to cache more memory. Also about slow index rebuild. You made many actions, so disk cache already gathered statistics about hot pages. Because index pages which you are going to rebuild are accessed randomly, and very rarely, disk cache treats them as cold pages and do not want to cache it. But the bigger amount of memory for disk cache may also help in this situation. So, in general, I wait for your feedback after you increase disk cache size on 45%. Meanwhile, once we start to work on this issue, we implement string compression for you. So indexes will consume less space and will be frequently cached.
@laa ok, thx for the feedback. I changed the params as follows: -Xms1G -Xmx8G -Dstorage.diskCache.bufferSize=48000
Stopped the server, then started it again. Logs say now this:
2017-02-01 16:42:41:900 INFO [OVariableParser.resolveVariables] Error on resolving property: distributed [orientechnologies]
2017-02-01 16:42:41:903 WARNI Authenticated clients can execute any kind of code into the server by using the following allowed languages: [sql] [OServ
erSideScriptInterpreter]
2017-02-01 16:42:41:905 INFO OrientDB Studio available at http://10.49.69.190:2480/studio/index.html [OServer]
2017-02-01 16:42:41:905 INFO OrientDB Server is active v2.2.15 (build 2.2.x@rebf7b80c08613e871633edeb39c5c985e26b176c; 2017-01-19 07:39:00+0000). [OSe
rver]
2017-02-01 16:43:05:182 WARNI {db=mobile} Storage 'mobile' was not closed properly. Will try to recover from write ahead log [OLocalPaginatedStorage]
2017-02-01 16:43:05:182 INFO {db=mobile} Looking for last checkpoint... [OLocalPaginatedStorage]
2017-02-01 16:43:05:183 INFO {db=mobile} FULL checkpoint found. [OLocalPaginatedStorage]
2017-02-01 16:43:05:183 WARNI {db=mobile} FULL checkpoint has not completed. [OLocalPaginatedStorage]
2017-02-01 16:43:05:184 WARNI {db=mobile} Restore will start from the previous checkpoint. [OLocalPaginatedStorage]
2017-02-01 16:43:05:185 INFO {db=mobile} Data restore procedure from FUZZY checkpoint is started. [OLocalPaginatedStorage]
2017-02-01 16:43:05:188 INFO {db=mobile} 1 operations were processed, current LSN is LSN{segment=31008, position=16} last LSN is LSN{segment=31024, po
sition=85521596} [OLocalPaginatedStorage]
2017-02-01 16:43:05:189 WARNI {db=mobile} Non tx operation was used during data modification we will need index rebuild. [OLocalPaginatedStorage]
2017-02-01 16:43:35:194 INFO {db=mobile} 9795 operations were processed, current LSN is LSN{segment=31008, position=2640565} last LSN is LSN{segment=3
1024, position=85521596} [OLocalPaginatedStorage]
...
2017-02-01 17:51:50:444 INFO {db=mobile} 2110000 operations were processed, current LSN is LSN{segment=31012, position=28131746} last LSN is LSN{segme
nt=31024, position=85521596} [OLocalPaginatedStorage]
2017-02-01 17:51:57:233 WARNI {db=mobile} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=31012, position=29736269}} com.orientechnologies.orient.co
re.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=31011, position=97153164}} will be skipped dur
ing data restore [OLocalPaginatedStorage]
2017-02-01 17:51:57:404 WARNI {db=mobile} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{s
egment=31012, position=29841942}} will be skipped during data restore [OLocalPaginatedStorage]
2017-02-01 17:52:02:118 INFO {db=mobile} 2120000 operations were processed, current LSN is LSN{segment=31012, position=30845999} last LSN is LSN{segme
nt=31024, position=85521596} [OLocalPaginatedStorage]
...
(still busy, will report again when completed)
update:
...
2017-02-01 18:38:46:953 INFO {db=mobile} 8390000 operations were processed, current LSN is LSN{segment=31024, position=83146427} last LSN is LSN{segment=31024,
position=85521596} [OLocalPaginatedStorage]
2017-02-01 18:38:48:670 SEVER {db=mobile} Invalid WAL record type was passed class com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFileTrunc
atedWALRecord. Given record will be skipped. [OLocalPaginatedStorage]
2017-02-01 18:38:48:670 SEVER {db=mobile} Invalid WAL record type was passed class com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFileTrunc
atedWALRecord. Given record will be skipped. [OLocalPaginatedStorage]
2017-02-01 18:38:48:671 WARNI {db=mobile} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFullCheckpointStartRecord{lsn=LSN{segment=3
1024, position=85521573}, previousCheckpoint=LSN{segment=31024, position=77850119}} will be skipped during data restore [OLocalPaginatedStorage]
2017-02-01 18:38:48:671 INFO {db=mobile} Storage data recover was completed [OLocalPaginatedStorage]
2017-02-01 19:06:25:810 INFO {db=mobile} Wait till indexes restore after crash was finished. [OIndexManagerShared]
2017-02-01 19:06:25:948 INFO {db=mobile} Start creation of index 'OFunction.name' [OIndexManagerShared$RecreateIndexesTask]
...
and
% grep 'rebuild is going' /opt/orientdb/orientdb-community-2.2.15/log/orient-server.log.0
2017-02-01 19:06:26:056 INFO {db=mobile} Index 'OFunction.name' was successfully created and rebuild is going to be started [OIndexManagerShared$RecreateIndexe
sTask]
2017-02-01 19:06:32:482 INFO {db=mobile} Index 'idx_cdr_eno_20161218_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-01 22:07:21:218 INFO {db=mobile} Index 'ORole.name' was successfully created and rebuild is going to be started [OIndexManagerShared$RecreateIndexesTas
k]
2017-02-01 22:09:29:476 INFO {db=mobile} Index 'idx_cdr_eno_20161225_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 00:31:20:772 INFO {db=mobile} Index 'idx_cdr_eno_20170108_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 05:20:46:894 INFO {db=mobile} Index 'idx_cdr_eno_20170108_1' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 06:44:31:174 INFO {db=mobile} Index 'idx_cdr_eno_20161218_2' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
2017-02-02 09:40:33:389 INFO {db=mobile} Index 'idx_cdr_eno_20170101_0' was successfully created and rebuild is going to be started [OIndexManagerShared$Recrea
teIndexesTask]
Now it is rebuilding ALL of the indexes of ALL databases, apparently !! This will take days!
Done. Attempting restart...
-> so somehow, something inside ODB gets saturated or accumulated into a situation where ODB can no longer process input data. -> how can this be identified?
And now we seem to get one more problem in this never ending saga:
trying to query directly an index, and this gives an error:
select from INDEX:idx_cdr_eno_20170122_1
Gives "java.lang.NullPointerException"
In the server logs:
Internal server error:
java.lang.NullPointerException
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.fetchEntriesFromIndexCursor(OCommandExecutorSQLSelect.java:2466)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.searchInIndex(OCommandExecutorSQLSelect.java:2780)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.assignTarget(OCommandExecutorSQLSelect.java:532)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.executeSearch(OCommandExecutorSQLSelect.java:512)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLSelect.execute(OCommandExecutorSQLSelect.java:488)
at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:74)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2592)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2538)
at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69)
at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostCommand.execute(OServerCommandPostCommand.java:106)
at com.orientechnologies.orient.graph.server.command.OServerCommandPostCommandGraph.execute(OServerCommandPostCommandGraph.java:37)
at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:169)
at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:622)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)
Indexes corrupted, again ??
this failing index is a LUCENE index; querying another SBTREE UNIQUE index of the same class works fine.
@rdelangh could you create a separate entry for Lucene index. About this saturation, it can be identified by making of thread dump during such saturation. Could you make such one ?
@rdelangh about this saturation I suppose it is already fixed in a new version but let us check it.
@rdelangh about this
select from INDEX:idx_cdr_eno_20170122_1
it's not supported by the lucene index. You can use it with a filter or count
select from INDEX:idx_cdr_eno_20170122_1 where key ..
select count(*) from INDEX:idx_cdr_eno_20170122_1
Since it's an inverted index, the index iterator has not been implemented
@maggiolo00 I think we should provide a meaningful exception instead of NPE. Could you implement it as a separate issue?
The program which is loading data into dbase "mobile" suffers from these timeouts again.
So I attempted a stack trace:
# /usr/lib/jvm/jdk1.8.0_92/bin/java 29671 > /tmp/29671-jstack.txt
29671: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
# /usr/lib/jvm/jdk1.8.0_92/bin/jstack -F 29671 > /tmp/29671-jstack.txt
File attached: 29671-jstack.txt
That data loading into dbase "mobile" is now blocking entirely, while another data loading into dbase "cdrarch" is happily proceeding, so I tried another stack trace:
# /usr/lib/jvm/jdk1.8.0_92/bin/jstack -F 29671 > /tmp/29671-b-jstack.txt
java.lang.NullPointerException
at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:88)
at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45)
at sun.jvm.hotspot.tools.JStack.run(JStack.java:66)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
at sun.tools.jstack.JStack.main(JStack.java:106)
-> fairly stuck ?
meamwhile, the system is not out-of-memory:
# free -m
total used free shared buff/cache available
Mem: 128809 71604 54121 25 3082 55714
Swap: 130943 0 130943
# vmstat 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 2 0 55387960 86404 3089596 0 0 908 18942 5330 10927 1 1 94 4 0
1 3 0 55434460 86404 3089500 0 0 856 15413 4692 7787 1 1 94 4 0
1 2 0 55444076 86404 3067696 0 0 902 13750 6073 7681 1 1 94 4 0
0 2 0 55442068 86404 3068956 0 0 1199 6135 3593 4776 1 0 95 4 0
# ps -aefl
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S orientdb 29671 1 61 80 0 - 38739983 futex_ 13:04 pts/4 01:36:31 java -server -Xms1G -Xmx8G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=512g -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9 -Dstorage.openFiles.limit=1024 -Denvironment.dumpCfgAtStartup=true -Dquery.parallelAuto=true -Dquery.parallelMinimumRecords=50000 -Dstorage.wal.syncOnPageFlush=false -Dstorage.diskCache.bufferSize=54000 -Dmemory.chunk.size=33554432 -Djava.util.logging.config.file=/opt/orientdb/orientdb-community-2.2.15/config/orientdb-server-log.properties -Dorientdb.config.file=/opt/orientdb/orientdb-community-2.2.15/config/orientdb-server-config.xml -Dorientdb.www.path=/opt/orientdb/orientdb-community-2.2.15/www -Dorientdb.build.number=2.2.x@rebf7b80c08613e871633edeb39c5c985e26b176c; 2017-01-19 07:39:00+0000 -cp /opt/orientdb/orientdb-community-2.2.15/lib/orientdb-server-2.2.15.jar:/opt/orientdb/orientdb-community-2.2.15/lib/*:/opt/orientdb/orientdb-community-2.2.15/plugins/* com.orientechnologies.orient.server.OServerMain
Quite some open files:
# lsof -p 29671| grep /opt |wc -l
4315
@rdelangh could you print and send me list of open files also according to stack trace which you sent to me you hit limit size, could you send me what is your WAL size now? Are you able to take thread dump now ?
@laa I have set "dumpCfgAtStartup = true", so at startup I get the following info about "wal" :
Not sure if this is the information about WAL you like to know; if you need other information then please add the instructions that I must use to obtain that information.
Thread dump: likely impossible to take because ODB server process has huge memory footprint (see above, approx 35GB)
And the open-files list in attach odb-29671-lsof.txt.gz
@rdelangh thank you for a list of open files. The thread dump is not heap dump.The thread dump is a list of stack traces each of which points to the position of execution of a program in the single thread. You have already taken thread dump with a stack. I already looked at your thread dump, the first time OrientDB paused because of you reached a limit of WAL size and free space size that is why I asked about WAL limits. I suppose that you use default settings in which WAL max size is only 4GB once it reached this size synchronous flush of data is performed and as a result, you see pauses in execution. We already changed flush mechanics to avoid those pauses.It is fixed in https://github.com/orientechnologies/orientdb/tree/ycsb_freeze branch. But you need to wait for 3.0 to make load testing of it. I can create a build of the ycsb_freeze branch for you if you wish to check it. Meanwhile, to prevent pauses during load, you may increase max size of WAL (storage.wal.maxSize parameter defines a size of WAL in megabytes) till reasonable value let say 20-50 GB, depends on your disk size. Putting aside pauses how a performance of load changed after you increased disk cache size?
@laa The config-dump at startup lists these WAL settings (see above), which I have not changed in the "server.sh" script; I notice indeed this line
storage.wal.maxSize = 4096
@rdelangh I updated comment above. What is current status of load , do you still have pause on frozen client ?
@laa I only keep the job-stream running which sends data to dbase "cdrarch", no concurrency (only one program is sending data); I have temporarily disabled the parallel running job-streams that send data to dbase "mobile"
-> I will wait till that first stream (most important) has recovered from its backlog; then I will do new attempts with that (blocking) second stream with parallel jobs
@laa thx for the follow-up!
@rdelangh I suppose good one for you is about 16GB or a bit more 20GB what is your disk size? What is your DB size ?
@laa
@rdelangh WAL is placed on a file system. In such case I suggest you to set WAL size to 50 GB which equals to "storage.wal.maxSize = 51200" it will substantially decrease pauses during data load.
OrientDB Version, operating system, or hardware.
Operating System
We create classes per week of data, with 6 pre-created clusters per class. We store some 120M documents per such class, which translates into sizes of the cluster files on the server to about 20GB per cluster file. For example, for the class of week starting on Jan-22 2017:
Similarly the index files are obviously also fairly big:
Compared to the files for class with data that started only 2 days ago on Jan-29 2017:
The loading of extra documents in such a class, hence with fairly large cluster or index files, is going an order of magnitude slower compared to loading of documents into a class which has not many documents yet and consequently whose files are still much smaller:
Anything that can be tuned on this?
Obviously, the cluster files can be reduced in size by creating more clusters per class, so less data per cluster. This would break our cluster-sharding logic, would imply a lot of changes, but it is not impossible. About the index files: this is internal ODB business, where only a single file contains an entire index, and hence will always grow very big.
All suggestions are very welcome!