Can't create indexes on some classes.

anber500 commented 7 years ago

OrientDB Version: 2.2.17

Java Version: 1.8.0_111

OS: Ubuntu 14.04.4

Expected behavior

I should be able to create or delete indexes. The following command should not generate an error.

CREATE INDEX Users.network_id ON Users (network_id) UNIQUE_HASH_INDEX

Actual behavior

When I try to create an index on my class, it generate the following error:

2017-04-06 04:39:38:398 INFO  {db=MyDB} --> OK, indexed 10,615 items in 478 ms [OIndexRebuildOutputListener]$ANSI{green {db=MyDB}} Exception during index 'Users.network_id' creation
com.orientechnologies.orient.core.exception.OStorageException: Error during local hash table creation
        DB name="MyDB"
        at com.orientechnologies.orient.core.index.hashindex.local.OLocalHashTable.create(OLocalHashTable.java:195)
        at com.orientechnologies.orient.core.index.engine.OHashTableIndexEngine.create(OHashTableIndexEngine.java:94)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.addIndexEngine(OAbstractPaginatedStorage.java:1607)
        at com.orientechnologies.orient.core.index.OIndexAbstract.create(OIndexAbstract.java:247)
        at com.orientechnologies.orient.core.index.OIndexOneValue.create(OIndexOneValue.java:129)
        at com.orientechnologies.orient.core.index.OIndexOneValue.create(OIndexOneValue.java:40)
        at com.orientechnologies.orient.core.index.OIndexManagerShared.createIndex(OIndexManagerShared.java:167)
        at com.orientechnologies.orient.core.index.OIndexManagerProxy.createIndex(OIndexManagerProxy.java:87)
        at com.orientechnologies.orient.core.metadata.schema.OClassImpl.createIndex(OClassImpl.java:1922)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLCreateIndex.execute(OCommandExecutorSQLCreateIndex.java:277)
        at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:74)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:2624)
        at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:2570)
        at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:69)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1473)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:597)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:336)
        at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:200)
        at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)
Caused by: com.orientechnologies.orient.core.exception.OStorageException: File with given name already exists but has different id 185 vs. proposed 2666873368795415589
        DB name="MyDB"
        at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.addFile(OWOWCache.java:488)
        at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.addFile(O2QCache.java:187)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperation.commitChanges(OAtomicOperation.java:415)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.endAtomicOperation(OAtomicOperationsManager.java:468)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.endAtomicOperation(OAtomicOperationsManager.java:412)
        at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurableComponent.endAtomicOperation(ODurableComponent.java:116)
        at com.orientechnologies.orient.core.index.hashindex.local.OLocalHashTable.create(OLocalHashTable.java:189)
        ... 18 more

This only happens on my Users class. Other indexes still work fine. The index does not show up in the studio or in the console. I tried removing the index but it doesn't look like it exist. I can't create it and it keeps giving me the error.

The following also doesn't work:

CREATE INDEX Users.network_id ON Users (network_id) UNIQUE

Is there a way to get rid of this error?

andrii0lomakin commented 7 years ago

@anber500 could you send me this database ?

anber500 commented 7 years ago

I can't. It's a production DB and it's over 10 gig in size. Which is part of the problem. Without the indexes, the DB is unusable. All our tasks timeout because it takes too long to run basic queries.

andrii0lomakin commented 7 years ago

Why you can not create index like CREATE INDEX Users_network_id ON Users (network_id) UNIQUE_HASH_INDEX what stops you to do that ?

anber500 commented 7 years ago

Because that is not what the documentation recommends?

Your suggestion works as a possible workaround but doesn't really fix the problem. I've found 4 indexes in our DB that has the same issue.

If there not a way to manually delete these indexes created on disk?

If I create the indexes with a name without the period, will it still function as expected?

andrii0lomakin commented 7 years ago

Indexes can have arbitrary names it does not affect performance or stability of queries. I understand that it is the only workaround but it will fix your main problem. I want to ask quick question did you migrate from 2.1 or older version ?

anber500 commented 7 years ago

We have been keeping up with each new version as it came out. I'm not sure on which version of OrientDB the current DB was created but I'm sure it was on 2.2.x. The current DB was created about 6-9 months ago on what was the latest version at the time.

andrii0lomakin commented 7 years ago

@anber500 Could you send me "name_id_map.cm" file and list of names of indexes which have the same problem ?

andrii0lomakin commented 7 years ago

Could you also check that file with name "Users.network_id" exists on file system ?

anber500 commented 7 years ago

Sure, hope it helps. name_id_map.cm.zip

I could also not create an index on my Hashes.

CREATE INDEX Hashes.tag ON Hashes (tag) UNIQUE_HASH_INDEX

However, this worked:

CREATE INDEX Hashes_tag ON Hashes (tag) UNIQUE_HASH_INDEX

Here is the related files that exist on the disk:

Hashes_tag.hib Hashes.tag.hib Hashes_tag.him Hashes.tag.him Hashes_tag.hit Hashes.tag.hit Hashes_tag.hnb Hashes.tag.hnb Hashes.tag.nbt Hashes.tag.sbt

The Hashes.tag index did exist at some point but somehow got corrupted. However, it's invisible to the DB and can't be removed via the studio or the console.sh

andrii0lomakin commented 7 years ago

@anber500 one more questions is it correct that you have files inside file system but you can not create indexes because of exception above? And one more question what do you mean by "The Hashes.tag index did exist at some point but somehow got corrupted" how did you understand that index was corrupted? Sorry for the delay in answers but we work on several tasks at once and seems like we found workaround for you so that is not first priority issue but still very important for us.

andrii0lomakin commented 7 years ago

@anber500 looking forward to your response :-)

anber500 commented 7 years ago

I have also been pretty busy. It's panic stations over here because our entire DB got ruined because of an unrelated, or related, Indexing bug.

Your assumption is correct. We have the files in ours file system but we can't see, delete or create any index with the same name.

We had an index called "Hashes.tag". Our DB got slower and slower and after a few days of debugging, out of desperation, I decided to rebuild all the indexes. This is quite a process because it renders our DB useless for over 2 hours! It turns out that our problem was with the indexes.

While rebuilding the indexes, OrientDB started complaining about indexes that it could not create. There was about 4 of them. I noticed that in the studio that the indexes no longer existed. I tried to recreate the indexes but got the error I described.

The files exist on the drive, but neither the console nor the studio can access or remove those indexes.

andrii0lomakin commented 7 years ago

I will provide a fix for you till Wednesday, meanwhile, could you explain what happened. I mean what the "It's panic stations over here because our entire DB got ruined because of an unrelated, or related, Indexing bug" in details. We are here to help so I am looking forward to solving your problems.

anber500 commented 7 years ago

I don't know if I should log a separate issue. It's related to the Indexes but I assume that it's not related to this specific defect. There is no way to replicate the bug so I'm not sure how to continue. Please advise how we should proceed.

andrii0lomakin commented 7 years ago

@anber500 I was quite sure that it is unrelated. So I would appreciate if you create a separate issue.

andrii0lomakin commented 7 years ago

@anber500 I have a question for before we start to develop. Are you able to stop database to fix this issue ?

anber500 commented 7 years ago

I'm not sure what you're asking. Can I stop the database to fix the issue? Yes, I can stop the database to fix the issue.

Can I actually stop the database? No, it often locks up and just hangs for hours before I'm forced to do a hard shut down. I know that we should not use kill -9 but I can not have our database offline for over 2 hours.

andrii0lomakin commented 7 years ago

Hi @anber500, we discussed several approaches how to fix your issue with files. First, of the approaches create command which you will execute as SQL command to remove a file from the database. The second method is following, even right now when you start storage, the system checks a list of files present in database folder and a list of files registered inside the database itself, and if they do not match, then a list of files registered inside of database is updated. Because you need to stop database in both cases (in the first case you need to update server library), I think would be simpler for you do not update server libraries but remove files from file system, the have the same name is an index name. My point is that it is a one-time problem and because you need to stop server anyway, I suppose would be simpler and faster if you remove related files from file system. What do you think? About your next problem, that you wait several hours till a server shutdown. I believe we should fix it too. If you have the same issue next time could you after some reasonable time, 20 min I suppose will be enough, to take several thread dumps of a server using jstack command and send the results to us? Could you also describe which environment do you have? I mean: do you use SSD or HDD, what is your RAM size, what is the size of your database on disk, do you set specific settings or use default ones? Please do not hesitate to create issues if you have ones. We are here to help you.

anber500 commented 7 years ago

Removing the files would be faster I suppose.

We had 3 servers on AWS running in distributed mode. We are connecting to OrientDB via pyorient on port 2424. Our data is captured on another AWS instance and the data goes into a message queue. The message queue is handled by rabbitmq and celery.

Each message is send to the database synchronously. We process the messages "synchronously" because OrientDB had issues with concurrency when using Celery.

Each AWS instance is setup as follows:

instance type: m3.2xlarge
processors: 8
ram: 30 gig
network performance: High
os: Ubuntu 14.04.4 LTS
Java: 1.8.0_111
drives: os - 1x SSD 15 gig / data - 1xSSD 100 gig

We have made some drastic changes over the past week. Having one single DB with all our clients in one database have proven to be a terrible idea. With over 24 million edges, the database has become slow and unstable. Rebuilding our indexes takes 2-3 hours, backup takes 3 hours, queries are slow and sometimes data just disappears between the various nodes.

We have taken OrientDB out of distributed mode and have spawned separate databases for each of our clients. This makes the individual databases much smaller and easier to backup. It simplifies queries and if the indexes breaks again, only a single client will be affected (That's the theory).

Another benefit of NOT running in distributed mode is that we don't seem to run into the latency issues we had before.

We had to implement a broker to manage the DB connections from our code. Depending on the client, it will open a connection to the correct DB and server. This makes it easier for us to scale our server architecture horizontally. Still would have been nice to have everything in one DB ;)

andrii0lomakin commented 7 years ago

@anber500 I see we are working literally day and night so you will be able to run on distributed storage without issues too. What is your experience with ODB on you new architecture? Do you have any issues so far? Did you manage to remove broken index files?

anber500 commented 7 years ago

We have been trying to get distributed mode working for us for over a year now. That ship has sailed and we are no longer interested in entertaining the hope that distributed mode will work for us.

We make use of Django so we are fairly familiar with ORM. If that is related to ODB?

We are having some issues. We are getting 10X much better performance now than before. However, our servers have been randomly been running out of memory and crashing.

We have over 30 gigs of ram and our entire DB is less than 13 gigs in size so it doesn't make any sense. We have tried pushing our heap and stack to 6 gig each but our DB still crashes after a few hours.

We have not bothered removing the old indexes as we still don't know how. But at this stage it doesn't bother us because the old DB was running in distributed mode and the new DB's don't. The performance is better and even though we still get random crashes, at least our entire topology aren't affected.

andrii0lomakin commented 7 years ago

@anber500 we found a fix for your issue yesterday https://github.com/orientechnologies/orientdb/issues/7390 it will be in 2.2.20. I will provide you the date of release soon. Once you update it. I am looking forward to your new feedback because our target is to make your experience absolutely smooth. Again we will appreciate all feedback which you will provide to us. In general, if you have an issue please provide us feedback so we will fix it for you ASAP.

anber500 commented 7 years ago

Goody, looking forward to the fix.

anber500 commented 7 years ago

Our DB is falling over daily. It's really annoying. If there is a fix for the memory leak, is there a snapshot available? I have to restart the DB daily and our indexes keep screwing up. The problem now is that we have dozens of DBs and we have to test each of them before we can find a DB with some bad indexes.

andrii0lomakin commented 7 years ago

@anber500 I will send you new build in few hours. It should fix your issue.

andrii0lomakin commented 7 years ago

@anber500 there is new build https://drive.google.com/file/d/0B2oZq2xVp841T2diVGtTcmZ5OTQ/view?usp=sharing

anber500 commented 7 years ago

Thanks, I deployed it and will monitor it over the weekend. I had some weird line-ending errors on the bash scripts that I had to fix before I could deploy it on Ubuntu.

I had to remove the ^M control chars before I could deploy. We haven't had this issue before so it must have been introduced recently.

andrii0lomakin commented 7 years ago

@anber500 I suppose that is because it is Windows build. The official build should not have given problems.

anber500 commented 7 years ago

The servers have been running stable for the last few days. We have 3 server running. It's the first time in months that all three our servers have been running stable without at least one of them failing or running out of memory. The memory consumption has been stable for a few days and our server seems happy. I know it's unrelated but if our servers don't die, our indexes won't be affected so in a way the fix you mentioned works great.

andrii0lomakin commented 7 years ago

Hi @anber500 thank you very much for your patience and persistence. Looking forward to your feedback.

andrii0lomakin commented 7 years ago

Hi, @anber500 do you have any feedback about your issues.

Which issues do you still have? What is the stability of a system by now?

anber500 commented 7 years ago

Our system seems much more stable. None of our DBs have crashed in the last few days. The memory-leak fix have solved most of our issues that we've had over the last year.

We are still experiencing some odd issues with the indexes. Especially with DBs that previously crashed.

The big problem is that indexes totally fuck up when the server crashes. If you have a DB with millions of records, it doesn't only slow down the DB, it also messes with the uniqueness of records.

Let's assume I need a unique index for a record. If I add an index to enforce the uniqueness of the record and the index fails, OrientDB does not let me know about the error. Instead it keeps working and the only way I can tell that something in wrong is when the DB becomes super slow after a few thousand records. If I then try to rebuild the index, OrientDB will let me know that there are duplicate records and that I can't restore the index.

By that time, I can no longer rebuild the index because hundreds of duplicate records have been added to the DB.

This, together with other index irregularities, makes is extremely labour intensive to remove the duplicates.

andrii0lomakin commented 7 years ago

@anber500 how do you perform modification operations, do you use transactions or you work in nontransactional mode?

andrii0lomakin commented 7 years ago

@anber500 I will create a tool for you which will allow rebuilding unique indexes and will provide you a collection of RIDs of duplicated records which are not included in the index but exist in the database . Will it make your work of removing of duplications much easier for you?

anber500 commented 7 years ago

We don't make use of transactions.

andrii0lomakin commented 7 years ago

@anber500 if you use transactions you will not have problems with duplicates because in such case both index and document update will be single atomic operation. What about a tool which I described above. Do you need one?

anber500 commented 7 years ago

Thank you. Such a tool would be useful. I have my own tool that detects and remove duplicate records but I'm having issues with the indexes. Consider the following structure:

CREATE CLASS Posts_base EXTENDS V
CREATE PROPERTY Posts_base.body STRING
CREATE PROPERTY Posts_base.comment_id STRING
CREATE INDEX Posts_base.comment_id ON Posts_base (comment_id) NOTUNIQUE

CREATE CLASS Posts EXTENDS Posts_base
CREATE INDEX Posts.comment_id ON Posts (comment_id) UNIQUE_HASH_INDEX

CREATE CLASS Comments EXTENDS Posts_base
CREATE INDEX Comments.comment_id ON Comments (comment_id) UNIQUE_HASH_INDEX

If the indexes gets messed up and OrientDB doesn't let me know, we get duplicate records as I explained. Assume we have 2 records inserted into Posts with comment_id="123". If I try to rebuild the index, it will obviously fail because of the duplicates.

However, if I look for the duplicates with the following code, I won't see any records returned.

SELECT FROM (SELECT count(*) as num,comment_id FROM Posts group by comment_id) WHERE num>1

If I deleted the indexes for Posts, I will also not find the duplicate records.

What is even weirder, is that if I query Posts with with the following query, I also get no records returned:

SELECT FROM Posts WHERE comment_id='123'

If I remove the NOTUNIQUE index for Posts_base, then that query will return the duplicate records in Posts.

This is very odd behaviour and makes duplicates very hard to find. If seems like although the Posts.comment_id index doesn't exist, if I query Posts, it doesn't find the records in the available indexes or clusters. Instead, it seems like it's getting data from some phantom index which is suppose to be invalid.

It also doesn't make sense that removing the indexes on Posts_base will make queries on Posts work as expected. Something weird is going on.

anber500 commented 7 years ago

We never used transactions because we were running in distributed mode and were having massive issues with latency. We have to do a lot of read/write commands and can't do many batch operations that could validate the use of transactions. We have considered it a while ago but using transactions is not going to be a simple task because of the nature of our data.

lvca commented 7 years ago

@anber500 using a transaction to batch operations is actually a way to improve performance running distributed, because a server propagates 1 task with 10 operations inside, instead of, let's say 10 tasks.

anber500 commented 7 years ago

@laa has been amazing in helping us and don't want this index-related bug report to change into a debate whether or not we should use transaction or not. We have been down this path and transactions is not going to work for us. Please can we remain focused on the issue at hand.

andrii0lomakin commented 7 years ago

Hi, @anber500 sorry for long silence too much of other activities. But I wanted to ask you do you still need a tool which will rebuild your index and report all rids of duplicated records into the log so you may use this information to drop all duplicated records from the database ?

Could you also list all issues which you still have with ODB, so we will fix it for you?

andrii0lomakin commented 7 years ago

Hi @anber500 could you kindly provide what is your current situation with OrientDB?

anber500 commented 7 years ago

Hi, @laa I still have the issues I mentioned before. Since your memory leak fix, the DB has been very stable but we are still getting the weird issue I outlined before: comment

It's very weird and the issue only exist on the DB that had it's indexes corrupted. Databases that were created after the regular OrientDB crashes, doesn't seem to have this issue. Which makes sense because their indexes are still in tact.

Even if we remove the indexes on that DB, and then remove all the duplicates and then add the indexes back, the issue always seems to return. We get duplicate records because the indexes don't seem to work properly for that database.

andrii0lomakin commented 7 years ago

Hi @anber500 , I see. Thank you for feedback. Could you clarify, do you perform a shutdown of this database during the time when there are active connections to this database?

andrii0lomakin commented 7 years ago

I am asking because such approach can lead to duplicates in 2.2.x version. It is fixed in 3.0 though

anber500 commented 7 years ago

Nope, ideally I should never have to shut-down the database but the duplicates are caused while the server is up and running.

andrii0lomakin commented 7 years ago

@anber500 is it possible for you provide me:

The class definition for the class where you have duplicates.
What operations do you perform when you insert/update a new record in class which suffers from duplication issue.

I will try to create test and reproduce it on our side

andrii0lomakin commented 7 years ago

Hi, @anber500 is it possible to provide information which we requested?

andrii0lomakin commented 7 years ago

@anber500 could you inspect your log, do you have any other exceptions on your problematic node, except duplicated exception? How long you need to wait before this exception happen?

andrii0lomakin commented 7 years ago

@anber500 could you kindly share what is a type of property which you use in the unique index?

orientechnologies / orientdb