Incorrect result from count query

Bankde commented 7 years ago

OrientDB Version: 2.23

Java Version: 1.8.0

OS: freeBSD 10.3

I have encountered strange result during "count" query. Im not sure if this is actual bug or just my misunderstanding.

Expected behavior

count * and count rid should have equal number.

Actual behavior

These queries are in 2.2.23.

select count(@rid) from transactions  -  792743
select count(*) from transactions  -  1276958

I thought rid can never be null ? I even check if there is any record that rid is NULL but there is none.

select * from transactions where @rid is NULL  -  nothing

Then I realized we just upgraded ordb from 2.2.4 to 2.2.23. So I went back to query in 2.2.4 if it works correctly.

select count(*) from transactions  -  1276958
select count(@rid) from transactions  -  810532

Notice count(rid) result from v2.2.4 is not even equal to result from v2.2.23.

--

Steps to reproduce

I don't know how this happen.

luigidellaquila commented 7 years ago

Hi @Bankde

First of all, @rid can never be null, so the results are supposed to be the same.

On the SQL executor there is an optimization so that select count(*) is calculated using a schema API, OClass.count(), while select count(@rid) is performed as a manual count. My guess is that you have some problems in the db (eg. wrong metadata on the class) but it's hard to say it without some debugging.

Do you think you can share a DB that has this problem? This will help a lot to find the root cause

Thanks

Luigi

Bankde commented 7 years ago

We could share a DB. However the DB is 1G in size, which method of file transfer do you prefer ?

By the way, do you think we can continue using this database ? and could you suggest a way to hotfix a DB ?

Thanks

luigidellaquila commented 7 years ago

Hi @Bankde

If you could send a link with Dropbox or GoogleDrive it would be awesome. It's hard to say if the DB is corrupted and if it can be repaired, but we'll do what we can ;-)

Thanks

Luigi

Bankde commented 7 years ago

Sure I will send you link when we're ready. It will take a while because database is not on my own computer.

I'm also concerned how could you access the database because we have password on it. Is there anyway to remove password from database ?

Bankde commented 7 years ago

Ok we have found that one of our test database also has this issues, it is much smaller. (but still we don't know exactly what happen; thus we can't reproduce it) Could you send me your mail address or any method that I can send you a link to GoogleDrive ?

luigidellaquila commented 7 years ago

Hi @Bankde

Please send it to l.dellaquila -at- orientdb.com

Thanks

Luigi

Bankde commented 7 years ago

DB, log, along with more details sent. Thank you.

luigidellaquila commented 7 years ago

Hi @Bankde

From what I see from the db, you have a problem with a cluster: the cluster metadata report the wrong size, but the data contained in it are correct.

The fix requires a little low-level modification of the cluster, there is no public API for this, so OrientDB team has to do a little change to the code to expose this functionality.

In the meantime, just for the record, did this problem happen on a single stand-alone node or are you running distributed?

Thanks

Luigi

andrii0lomakin commented 7 years ago

HI, @Bankde are you absolutely sure that you did not have issues with the database before the update to 2.2.23? Before 2.2.15 versions we had a bug when some pages are only partially flushed, your issues looks exactly the same for the first glance?

Bankde commented 7 years ago

It's single stand-alone node. I don't think my team ever touch any distribute feature.

You might want to see those logs too (e.g. orientdb.err). There are some of these errors:

- expression item 'userID' cannot be resolved because current record is NULL
- Error on execution of command (I have seen 2 error stacks: paginatedStorage and OSBTree)

I'm not sure if these are related issues, it would be nice for us to get more information whether data still behave correctly.

So far we have found more strange thing like this (you can try on that DB)

select * from (select expand(out()) from data where dataID='UserProfile#uniq_usr_profile_member31@webautomate.com_1503300262863') where dataID='User#kim3@webautomate.com_1504677378263'

The above one returns 1 record. However if I switch between two data, switch expand 'out' to 'in', this one returns 0 record.

select * from (select expand(in()) from data where dataID='User#kim3@webautomate.com_1504677378263') where dataID='UserProfile#uniq_usr_profile_member31@webautomate.com_1503300262863'

@laa I couldn't say for sure. However, we did play 2.2.4 for a year and never have these kind of issues until we upgrade to 2.2.23 (unless those bugs were hidden in data, I don't know) then these happen to 2 different servers within 1-2 weeks.

Bankde commented 7 years ago

@laa Since it already past our working time, I will check more with production team tomorrow. Hopefully I could give you more accurate on ordb version use when these issue occur. Right now, I can confirm that at least 1 of 2 server, data is still ok (at least from user perspective) before upgrade to 2.23

Bankde commented 7 years ago

@laa We have gone through logs and backups. I can assure that the data on one server (the one I sent database to you) is still ok after having already upgraded to v2.23 for almost a whole month. The chaos seems to start at 1 Sept. (orient-server.log.7 I have sent you)

Unfortunately another server is lacking log and backup so we can't confirm on that one.

andrii0lomakin commented 7 years ago

@Bankde do you have a backup of the same database just before you update to 2.2.23 or just after you update to 2.2.23 but before the error happens. In other words most earlier backup with 2.2.23 and latest backup with 2.2.4? Could you send them to me to check data structure?

Bankde commented 7 years ago

@laa I will try to get the one suit your needs. Do you want me to send link via orientdb mail or gmail ? If ordb mail, what's your address ?

andrii0lomakin commented 7 years ago

@Bankde my email a.lomakin@orientdb.com could you send DBs there?

Bankde commented 7 years ago

DBs sent. Don't hesitate to ask me if you need more information.

Bankde commented 7 years ago

Any update on these issues ?

andrii0lomakin commented 7 years ago

Hi, @Bankde I do not feel quite well, I will continue in the middle of this week, or earlier if things will get better. Sorry for the delay.

Bankde commented 7 years ago

No problem. Thank for the update 👍 Wish you get well soon.

andrii0lomakin commented 7 years ago

@Bankde could you while I am looking at your databases to do the following:

Take the latest 2.2.3 backup.
Replace your database with the latest 2.2.3 backup.
Run your tests on the server again.

I am really curious whether is it possible to reproduce this issue again?

andrii0lomakin commented 7 years ago

@Bankde as additional precaution could you also run file system checks and disk SMART tests?

Bankde commented 7 years ago

Ok.

andrii0lomakin commented 7 years ago

@Bankde I also noticed that you have super small disk cache size 16 megabytes, that probably causes concurrency issues in 2.2.23 . I am going to check that too on our benchmark server, but it is busy now with other tests. Once it will be free I will do it for sure.

andrii0lomakin commented 7 years ago

@Bankde I am still investigating your issue though.

Bankde commented 7 years ago

I'm not sure what you mean by small disk cache. Could you please point it out ?

andrii0lomakin commented 7 years ago

@Bankde I mean following. In your logs I see OrientDB auto-config DISKCACHE=16MB (heap=989MB direct=524,288MB os=3,054MB) so your disk cache size is only 16MB. We need to have enough cache to keep pages of all simultaneously running transactions. But probably in your case it may cause a problem. I will start long running tests next week to be sure. Did you replace again database by 2.2.3 backup but with 2.2.23 libraries, do you see any issues?

Bankde commented 7 years ago

Already tried that for 2 days and no issue yet. I didn't expect to encounter any issues though. It took us 3 weeks running 2.2.23 before we got this issue the first time.

By the way, could you suggest a proper value of DiskCache ?

andrii0lomakin commented 7 years ago

@Bankde I think that 256M would be enough for you. But let's try to reproduce issue first if it will be possible.

andrii0lomakin commented 7 years ago

@Bankde BTW did you check disks?

Bankde commented 7 years ago

Already check before testing, 0 error found.

andrii0lomakin commented 7 years ago

@Bankde I am still investigating your DBs what I actually noticed when I opened DB first is following:


Connecting to database [plocal:./cw_database] with user 'admin'...
2017-09-19 15:58:59:345 WARNI {db=cw_database} Storage 'cw_database' was not closed properly. Will try to recover from write ahead log... [OLocalPaginatedStorage]
2017-09-19 15:58:59:353 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381418}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=2, position=93381425}} com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=2, position=93381378}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381465}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:354 WARNI {db=cw_database} Record OFuzzyCheckpointStartRecord{lsn=LSN{segment=2, position=93381472}} com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=LSN{segment=2, position=93381425}} will be skipped during data restore [OLocalPaginatedStorage]
2017-09-19 15:58:59:355 WARNI {db=cw_database} Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=LSN{segment=2, position=93381512}} will be skipped during data restore [OLocalPaginatedStorage]OK

It seems like you copied database without stopping the server. Is it true? Did you perform similar actions on a database located on your server just before those problems are started? Did you experience server crash may be?

Bankde commented 7 years ago

Only time I can think of doing that is when I copied the database and sent to you (I might forgot that time, deeply sorry). No transaction or any query during that time though.
Other than that, we do vm snapshot but it saves a complete state of machine (means every process including ordb will still be running if you restore) so I don't think that's a case. The issues also occur on another non-vm machine (the one that lack of backup and snapshot since we don't do any actions like copying files; thus we can't confirm that issue happens after v2.23 on that machine. However, due to similarities of errors, we believe it's the same one) so I don't think running on vm is a cause either.
Server crash, as far as I can remember there's no blackout during that period. Do you think stopping ordb a hard way could also be a cause ? (e.g. shutdown the server (via command line), or kill -9 ordb when it doesn't stop after 30min, etc) I will go check more on these.

andrii0lomakin commented 7 years ago

Hi, @Bankde, your issue is very strange, all previous backups completely normal but the last one which you sent to me is broken. Just in case if that is possible, could you send me broken database but after server shutdown, because if you just copy the database on the running server it will be completely broken does not matter are there any running transactions or not, and there will be no much use of it. So no much use to inspect it. For now, I consider two possibilities:

Very few disk cache size. I am running tests now it will take some time.
The server was killed/crashed but data restore did not go well, in such case I need to see a log of data restore.

I suppose for now we can assume that database was broken but would be really good to have the instance of the database which is copied after server shutdown.

Bankde commented 7 years ago

because if you just copy the database on the running server it will be completely broken does not matter are there any running transactions or not

I think I did that to some of previous backup, good that they're still normal. Not sure of the latest one, I will see if I can still get that DBs and send to you soon.

Checked on another server that also has same issue but no backup, it has 2048 MB disk cache size. So disk cache might not be a cause. 2017-09-18 13:16:36:238 INFO OrientDB auto-config DISKCACHE=2,048MB (heap=1,963MB direct=2,048MB os=8,086MB) [orientechnologies]

Personally I go with 2nd possibility. Gathering information on that right now.

What do you mean by log of data restore ? If it's orient log, I have sent you all (that's 1-2 days before issues occur) and those are all we have.

andrii0lomakin commented 7 years ago

@Bankde could you send me logs from other sever too. Also, I am waiting for DB backup in a broken state.

Bankde commented 7 years ago

Database sent.

Sorry, since we couldn't do anything more with existing data on that other server, my team had decided to purge those corrupted data, continue testing on new one and we don't have any backup / snapshot on that server, there is no log left.

That latest DB is the best I could do here.

andrii0lomakin commented 7 years ago

Hi, @Bankde, I have run tests which update records on 16 MB of disk cache and as I suppose you guessed tests passed without problems. Just in case I started them again, but they will run in the much longer interval of 2.5 days. I do not expect any problems though. As an additional experiment, I will create and fill data in the database under 2.2.4 version and then will run tests under 2.2.23 version. Maybe it will reveal some problems. Also, I carefully investigated your database. I have the impression that part of the files was not flushed during the close. But if you look at the log, logging is stopped once server receives SIGTERM signal. So even if some exceptions happen, we will never see them. I have fixed this problem and now even if the server receives SIGTERM signal logging proceeds. I have two questions for you:

Could you update to 2.2.28 ASAP so if you have some problems during server shutdown, we will see them?
Did you have any new problems with database once we resolved all your previous issues related to SQL and you created a new database?

Bankde commented 7 years ago

Might take a while, we normally have checklist test on new ordb version.
We haven't found any new problems yet. I have also tested on killing / crashing server and running old db from 2.2.4 in 2.2.23, all dbs still look fine.

andrii0lomakin commented 7 years ago

@Bankde , good, it will be really helpful if we encounter this issue again if you perform version update. So I encourage you to do that.

andrii0lomakin commented 7 years ago

@Bankde , I have run tests under the version of 2.2.24 on a database which was created under the version 2.2.3 , but as I suppose you guessed did not find any problems. Anyway if you still do not update to 2.2.28 version, I do suggest to update to version 2.2.29 because it contains fix of an issue critical for database consistency. I will leave the issue open for about a month and then will close it if you will not experience this problem anymore.

Bankde commented 7 years ago

Haha, already upgraded to 2.2.28, can I proceed to 2.2.29 straight away ? btw, the changelog is still drafted. https://github.com/orientechnologies/orientdb/wiki/OrientDB-2.2-Release-Notes

I will keep you update if anything goes wrong.

luigidellaquila commented 7 years ago

Hi @Bankde

Yes, absolutely, I suggest you to upgrade to 2.2.29. Sorry for the release notes, we had a bit of trouble during the release and just forgot to close it, now it's OK

Thanks

Luigi

Bankde commented 7 years ago

Hi, few question here. I have tried "CHECK DATABASE" on corrupted database from this issue and it returns Check of storage completed in 200ms. without errors. Is there any other API to check database consistency ? How can I make sure that our production databases are still ok without having to rely on those count queries ? (which probably still look ok even databases have gone corrupted right ?)

Thanks

andrii0lomakin commented 7 years ago

@Bankde check database also should check database consistency if you use a distribution from an official site. The output should be like following:

Scanning 8049 edges (skipEdges=0)...
Scanning edges completed
Scanning 809 vertices...
Scanning vertices completed
Check of graph 'plocal:../databases/GratefulDeadConcerts' completed in 0 secs
 scannedEdges.....: 8049
 edgesToRemove....: 0
 scannedVertices..: 809
 scannedLinks.....: 16098
 linksToRemove....: 0
 verticesToRepair.: 0

Checking storage.Check of storage is started...Check of storage completed in 23ms.  without errors.
Checking indexes.
Checking cluster ofunction  for index OFunction.name
cluster ofunction (6) |********************| 100%
Total errors found on indexes: 0

Bankde commented 7 years ago

Just realized I called check database in 2.2.23 and it only checks for storage. Try again in 2.2.29 and it now works really well.

Thank you.

andrii0lomakin commented 7 years ago

Close issue because it is not reproducible anymore.

orientechnologies / orientdb