[Bug]: Datahike implements `clojure.lang.Counted` in linear time

phronmophobic commented 2 years ago

What version of Datahike are you using?

0.5.1504

What version of Java are you using?

openjdk version "17" 2021-09-14 OpenJDK Runtime Environment Temurin-17+35 (build 17+35) OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode)

What operating system are you using?

MacOS 12.3.1

What database EDN configuration are you using?

{:store {:backend :mem}} and {:store {:backend :file}}

Describe the bug

The Datahike Db implements clojure.lang.Counted in linear time. However, "A class that implements Counted promises that it is a collection that implements a constant-time count".

I use a generic data inspector that relies on counted? to display data summaries in constant time and the db's implementation of counted makes the inspector unresponsive.

> (counted? mydb)
true
> (time (count mydb))
"Elapsed time: 2927.155375 msecs"
439010

What is the expected behaviour?

I would expect the datahike db to either implement clojure.lang.Counted in constant time or not implement that interface. The count function will still work even if clojure.lang.Counted is unimplemented. count will fall back to calling seq and counting the collection in linear time.

How can the behaviour be reproduced?

It's noticeable with any decently sized db using the following code.

> (counted? mydb)
true
> (time (count mydb))
"Elapsed time: 2927.155375 msecs"
439010

jsmassa commented 2 years ago

You are correct, this requirement isn't fulfilled anymore in all cases. Sorry that it messes up your program now. I guess, we really should remove this implementation as long as we are allowing index structure without a Counted implementation.

The count function of a Datahike database directly translates to the count function of the index structure being used. Originally that used to be the persistent sorted set which implements Counted in constant time. I guess, that is why the interface is implemented in the first place. However, for the new index structure, the hitchhiker-tree, an implementation in constant time isn't possible due to its nature of having hanging operations that are only realized when the respective parts are touched again - for example when iterating over its elements. The peculiarities of this structure has caused us some pain as well, so in time we will do an evaluation on if it makes sense to keep it at all and think about different structures we can use.

For now, we are heavily working on enabling the file-backend for use with the persistent-set index and on migration tooling, so that ideally the switch will be possible without much effort.

For in-memory databases you can already use the persistent-set index by adding {:index :datahike.index/persistent-set} to the database configuration. At this stage, I cannot recommend using it though as it hadn't been touched in a long time in the current release and most of the current tests were not run for this index. As long as you are not using history functions, it should be fine to use, but there are no guarantees. Better wait until the new durability changes are in as this will also include proper testing.

phronmophobic commented 2 years ago

Sounds good. I found a workaround for my particular issue.

jsmassa commented 2 years ago

Wonderful :)

replikativ / datahike