Closed Bankde closed 7 years ago
Hi @Bankde
First of all, @rid
can never be null, so the results are supposed to be the same.
On the SQL executor there is an optimization so that select count(*)
is calculated using a schema API, OClass.count()
, while select count(@rid)
is performed as a manual count. My guess is that you have some problems in the db (eg. wrong metadata on the class) but it's hard to say it without some debugging.
Do you think you can share a DB that has this problem? This will help a lot to find the root cause
Thanks
Luigi
We could share a DB. However the DB is 1G in size, which method of file transfer do you prefer ?
By the way, do you think we can continue using this database ? and could you suggest a way to hotfix a DB ?
Thanks
Hi @Bankde
If you could send a link with Dropbox or GoogleDrive it would be awesome. It's hard to say if the DB is corrupted and if it can be repaired, but we'll do what we can ;-)
Thanks
Luigi
Sure I will send you link when we're ready. It will take a while because database is not on my own computer.
I'm also concerned how could you access the database because we have password on it. Is there anyway to remove password from database ?
Ok we have found that one of our test database also has this issues, it is much smaller. (but still we don't know exactly what happen; thus we can't reproduce it) Could you send me your mail address or any method that I can send you a link to GoogleDrive ?
Hi @Bankde
Please send it to l.dellaquila -at- orientdb.com
Thanks
Luigi
DB, log, along with more details sent. Thank you.
Hi @Bankde
From what I see from the db, you have a problem with a cluster: the cluster metadata report the wrong size, but the data contained in it are correct.
The fix requires a little low-level modification of the cluster, there is no public API for this, so OrientDB team has to do a little change to the code to expose this functionality.
In the meantime, just for the record, did this problem happen on a single stand-alone node or are you running distributed?
Thanks
Luigi
HI, @Bankde are you absolutely sure that you did not have issues with the database before the update to 2.2.23? Before 2.2.15 versions we had a bug when some pages are only partially flushed, your issues looks exactly the same for the first glance?
It's single stand-alone node. I don't think my team ever touch any distribute feature.
You might want to see those logs too (e.g. orientdb.err). There are some of these errors:
- expression item 'userID' cannot be resolved because current record is NULL
- Error on execution of command (I have seen 2 error stacks: paginatedStorage and OSBTree)
I'm not sure if these are related issues, it would be nice for us to get more information whether data still behave correctly.
So far we have found more strange thing like this (you can try on that DB)
select * from (select expand(out()) from data where dataID='UserProfile#uniq_usr_profile_member31@webautomate.com_1503300262863') where dataID='User#kim3@webautomate.com_1504677378263'
The above one returns 1 record. However if I switch between two data, switch expand 'out' to 'in', this one returns 0 record.
select * from (select expand(in()) from data where dataID='User#kim3@webautomate.com_1504677378263') where dataID='UserProfile#uniq_usr_profile_member31@webautomate.com_1503300262863'
@laa I couldn't say for sure. However, we did play 2.2.4 for a year and never have these kind of issues until we upgrade to 2.2.23 (unless those bugs were hidden in data, I don't know) then these happen to 2 different servers within 1-2 weeks.
@laa Since it already past our working time, I will check more with production team tomorrow. Hopefully I could give you more accurate on ordb version use when these issue occur. Right now, I can confirm that at least 1 of 2 server, data is still ok (at least from user perspective) before upgrade to 2.23
@laa We have gone through logs and backups. I can assure that the data on one server (the one I sent database to you) is still ok after having already upgraded to v2.23 for almost a whole month. The chaos seems to start at 1 Sept. (orient-server.log.7 I have sent you)
Unfortunately another server is lacking log and backup so we can't confirm on that one.
@Bankde do you have a backup of the same database just before you update to 2.2.23 or just after you update to 2.2.23 but before the error happens. In other words most earlier backup with 2.2.23 and latest backup with 2.2.4? Could you send them to me to check data structure?
@laa I will try to get the one suit your needs. Do you want me to send link via orientdb mail or gmail ? If ordb mail, what's your address ?
@Bankde my email a.lomakin@orientdb.com could you send DBs there?
DBs sent. Don't hesitate to ask me if you need more information.
Any update on these issues ?
Hi, @Bankde I do not feel quite well, I will continue in the middle of this week, or earlier if things will get better. Sorry for the delay.
No problem. Thank for the update 👍 Wish you get well soon.
@Bankde could you while I am looking at your databases to do the following:
I am really curious whether is it possible to reproduce this issue again?
@Bankde as additional precaution could you also run file system checks and disk SMART tests?
Ok.
@Bankde I also noticed that you have super small disk cache size 16 megabytes, that probably causes concurrency issues in 2.2.23 . I am going to check that too on our benchmark server, but it is busy now with other tests. Once it will be free I will do it for sure.
@Bankde I am still investigating your issue though.
I'm not sure what you mean by small disk cache. Could you please point it out ?
@Bankde I mean following. In your logs I see OrientDB auto-config DISKCACHE=16MB (heap=989MB direct=524,288MB os=3,054MB)
so your disk cache size is only 16MB. We need to have enough cache to keep pages of all simultaneously running transactions. But probably in your case it may cause a problem. I will start long running tests next week to be sure. Did you replace again database by 2.2.3 backup but with 2.2.23 libraries, do you see any issues?
Already tried that for 2 days and no issue yet. I didn't expect to encounter any issues though. It took us 3 weeks running 2.2.23 before we got this issue the first time.
By the way, could you suggest a proper value of DiskCache ?
@Bankde I think that 256M would be enough for you. But let's try to reproduce issue first if it will be possible.
@Bankde BTW did you check disks?
Already check before testing, 0 error found.
@Bankde I am still investigating your DBs what I actually noticed when I opened DB first is following:
Connecting to database [plocal:./cw_database] with user 'admin'...
2017-09-19 15:58:59:345 WARNI {db=cw_database} Storage 'cw_database' was not closed properly. Will try to recover from write ahead log... [OLocalPaginatedStorage]
2017-09-19 15:58:59:353 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381418}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=2, position=93381425}} com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=2, position=93381378}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381465}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=2, position=93381472}} com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=2, position=93381425}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:355 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381512}} will be skipped during data restore [OLocalPaginatedStorage]OK
It seems like you copied database without stopping the server. Is it true? Did you perform similar actions on a database located on your server just before those problems are started? Did you experience server crash may be?
Only time I can think of doing that is when I copied the database and sent to you (I might forgot that time, deeply sorry). No transaction or any query during that time though.
Other than that, we do vm snapshot but it saves a complete state of machine (means every process including ordb will still be running if you restore) so I don't think that's a case. The issues also occur on another non-vm machine (the one that lack of backup and snapshot since we don't do any actions like copying files; thus we can't confirm that issue happens after v2.23 on that machine. However, due to similarities of errors, we believe it's the same one) so I don't think running on vm is a cause either.
Server crash, as far as I can remember there's no blackout during that period. Do you think stopping ordb a hard way could also be a cause ? (e.g. shutdown the server (via command line), or kill -9 ordb when it doesn't stop after 30min, etc) I will go check more on these.
Hi, @Bankde, your issue is very strange, all previous backups completely normal but the last one which you sent to me is broken. Just in case if that is possible, could you send me broken database but after server shutdown, because if you just copy the database on the running server it will be completely broken does not matter are there any running transactions or not, and there will be no much use of it. So no much use to inspect it. For now, I consider two possibilities:
I suppose for now we can assume that database was broken but would be really good to have the instance of the database which is copied after server shutdown.
because if you just copy the database on the running server it will be completely broken does not matter are there any running transactions or not
I think I did that to some of previous backup, good that they're still normal. Not sure of the latest one, I will see if I can still get that DBs and send to you soon.
Checked on another server that also has same issue but no backup, it has 2048 MB disk cache size. So disk cache might not be a cause.
2017-09-18 13:16:36:238 INFO OrientDB auto-config DISKCACHE=2,048MB (heap=1,963MB direct=2,048MB os=8,086MB) [orientechnologies]
Personally I go with 2nd possibility. Gathering information on that right now.
What do you mean by log of data restore ? If it's orient log, I have sent you all (that's 1-2 days before issues occur) and those are all we have.
@Bankde could you send me logs from other sever too. Also, I am waiting for DB backup in a broken state.
Database sent.
Sorry, since we couldn't do anything more with existing data on that other server, my team had decided to purge those corrupted data, continue testing on new one and we don't have any backup / snapshot on that server, there is no log left.
That latest DB is the best I could do here.
Hi, @Bankde, I have run tests which update records on 16 MB of disk cache and as I suppose you guessed tests passed without problems. Just in case I started them again, but they will run in the much longer interval of 2.5 days. I do not expect any problems though. As an additional experiment, I will create and fill data in the database under 2.2.4 version and then will run tests under 2.2.23 version. Maybe it will reveal some problems. Also, I carefully investigated your database. I have the impression that part of the files was not flushed during the close. But if you look at the log, logging is stopped once server receives SIGTERM signal. So even if some exceptions happen, we will never see them. I have fixed this problem and now even if the server receives SIGTERM signal logging proceeds. I have two questions for you:
@Bankde , good, it will be really helpful if we encounter this issue again if you perform version update. So I encourage you to do that.
@Bankde , I have run tests under the version of 2.2.24 on a database which was created under the version 2.2.3 , but as I suppose you guessed did not find any problems. Anyway if you still do not update to 2.2.28 version, I do suggest to update to version 2.2.29 because it contains fix of an issue critical for database consistency. I will leave the issue open for about a month and then will close it if you will not experience this problem anymore.
Haha, already upgraded to 2.2.28, can I proceed to 2.2.29 straight away ? btw, the changelog is still drafted. https://github.com/orientechnologies/orientdb/wiki/OrientDB-2.2-Release-Notes
I will keep you update if anything goes wrong.
Hi @Bankde
Yes, absolutely, I suggest you to upgrade to 2.2.29. Sorry for the release notes, we had a bit of trouble during the release and just forgot to close it, now it's OK
Thanks
Luigi
Hi, few question here.
I have tried "CHECK DATABASE" on corrupted database from this issue and it returns
Check of storage completed in 200ms. without errors.
Is there any other API to check database consistency ? How can I make sure that our production databases are still ok without having to rely on those count queries ? (which probably still look ok even databases have gone corrupted right ?)
Thanks
@Bankde check database
also should check database consistency if you use a distribution from an official site. The output should be like following:
Scanning 8049 edges (skipEdges=0)...
Scanning edges completed
Scanning 809 vertices...
Scanning vertices completed
Check of graph 'plocal:../databases/GratefulDeadConcerts' completed in 0 secs
scannedEdges.....: 8049
edgesToRemove....: 0
scannedVertices..: 809
scannedLinks.....: 16098
linksToRemove....: 0
verticesToRepair.: 0
Checking storage.Check of storage is started...Check of storage completed in 23ms. without errors.
Checking indexes.
Checking cluster ofunction for index OFunction.name
cluster ofunction (6) |********************| 100%
Total errors found on indexes: 0
Just realized I called check database in 2.2.23 and it only checks for storage. Try again in 2.2.29 and it now works really well.
Thank you.
Close issue because it is not reproducible anymore.
OrientDB Version: 2.23
Java Version: 1.8.0
OS: freeBSD 10.3
I have encountered strange result during "count" query. Im not sure if this is actual bug or just my misunderstanding.
Expected behavior
count * and count rid should have equal number.
Actual behavior
These queries are in 2.2.23.
I thought rid can never be null ? I even check if there is any record that rid is NULL but there is none.
Then I realized we just upgraded ordb from 2.2.4 to 2.2.23. So I went back to query in 2.2.4 if it works correctly.
Notice count(rid) result from v2.2.4 is not even equal to result from v2.2.23.
--
Steps to reproduce
I don't know how this happen.