replikativ / datahike

A fast, immutable, distributed & compositional Datalog engine for everyone.
https://datahike.io
Eclipse Public License 1.0
1.6k stars 95 forks source link

No implementation of method: :-resolve-chan of protocol: #'hitchhiker.tree.node/IAddress found for class: clojure.lang.PersistentArrayMap #285

Open livtanong opened 3 years ago

livtanong commented 3 years ago

I've been working on a toy discord bot app, and the following error has appeared twice. I will be elaborating on each instance. Each instance was preceded by some kind of OOM error, but such errors occurred before, and the app was fine afterwards. It was only after a few days of operation (and during this time, around one restart a day because of the OOM) that this error appeared. At this point, any attempt to transact gives The Error.

The Error:

Error during transaction java.lang.IllegalArgumentException: No implementation of method: :-resolve-chan of protocol: #'hitchhiker.tree.node/IAddress found for class: clojure.lang.PersistentArrayMap

The first instance

The second instance

TimoKramer commented 3 years ago

Hi @levitanong. Can you please share your current dependencies of your project? I would try some things to replicate when I get a sense of where the problem might come from.

livtanong commented 3 years ago

here you go!

org.clojars.mihaelkonjevic/konserve-pg {:mvn/version "0.1.2"}
io.replikativ/datahike {:mvn/version "0.3.3"}
lambdaisland/uri {:mvn/version "1.4.54"}
tick/tick {:mvn/version "0.4.27-alpha"}
org.suskalo/discljord {:mvn/version "1.2.0"}
cprop/cprop {:mvn/version "0.1.17"}
mount/mount {:mvn/version "0.1.16"}
com.opentable.components/otj-pg-embedded {:mvn/version "0.13.3"}
ch.gluet/couplet {:mvn/version "0.2.1"}
com.taoensso/timbre {:mvn/version "5.1.1"}
io.replikativ/datahike-jdbc {:mvn/version "0.1.2-SNAPSHOT"}
whilo commented 3 years ago

I guess reproducing the OOM error is not easy, but if it would be possible to have a minimal reproducible example that would be helpful. I think we have not dealt with OOM errors before, it is possible that there is some serialization error when writing the hitchhiker-tree in this case. OOMs are tricky because runtime behaviour after them is undefined, nonetheless we need to guard against corruption that would pop up after runtime restarts.

TimoKramer commented 3 years ago

Hi @levitanong . Can you remove the dependency on konserve-pg? That should not be needed here.

TimoKramer commented 3 years ago

Did the error first appear when you upgraded from 0.3.2 to 0.3.3?

livtanong commented 3 years ago

Ah, konserve-pg was a holdover from when I was experimenting with Crux. I'll take that out.

As for the version, I never used 0.3.2. It was 0.3.3 from the beginning.

livtanong commented 3 years ago

Ah no, correction, konserve-pg was there because I was trying to figure out a separate bug. 😅 Still, removed.

TimoKramer commented 3 years ago

So, disregarding the OOM for now I hope that the original error about the IllegalArgumentException disappears when the old konserve-dependency is removed. Is this the case?

livtanong commented 3 years ago

With that dependency removed, the error is still there, and I suspect the only way for me to "recover" is by doing the whole export-import trick. As for whether or not removing the dependency will prevent the IllegalArgumentException from happening, I suppose only time will tell. I think it would be impractical to keep this open for X amount of time, so I think I'll close this first, then reopen when it happens again :) Thank you, I really appreciate the support and effort you're putting into this.

livtanong commented 3 years ago

I regret to inform you that it's still happening 😆

2021-02-19T15:06:02.648018+00:00 app[worker.1]: 2021-02-19T15:06:02.647Z 5b18e6e1-9087-4423-8c9d-16215a1a2e00 ERROR [datahike.db:1065] - Nothing found for entity id  ELIDED {:error :entity-id/missing, :entity-id ELIDED}
2021-02-19T15:06:02.648708+00:00 app[worker.1]: 2021-02-19T15:06:02.648Z 5b18e6e1-9087-4423-8c9d-16215a1a2e00 ERROR [datahike.connector:74] - Error during transaction clojure.lang.ExceptionInfo: Nothing found for entity id ELIDED {:error :entity-id/missing, :entity-id ELIDED}
2021-02-19T16:58:38.268692+00:00 app[worker.1]: 2021-02-19T16:58:38.267Z 5b18e6e1-9087-4423-8c9d-16215a1a2e00 ERROR [datahike.connector:74] - Error during transaction java.lang.IllegalArgumentException: No implementation of method: :-resolve-chan of protocol: #'hitchhiker.tree.node/IAddress found for class: clojure.lang.PersistentArrayMap
livtanong commented 3 years ago

I'm not sure if this is related, but the heap space errors I'm getting that usually precedes IllegalArgumentException involves FressianReader. I did some searching and Konserve uses FressianReader. I have no way of knowing if it's Konserve exactly though, or if it's something else that uses Fressian.

2021-02-22T07:38:49.948896+00:00 app[worker.1]: Exception in thread "async-thread-macro-36" java.lang.OutOfMemoryError: Java heap space
2021-02-22T07:38:49.949056+00:00 app[worker.1]: at clojure.lang.PersistentVector$TransientVector.editableTail(PersistentVector.java:574)
2021-02-22T07:38:49.949086+00:00 app[worker.1]: at clojure.lang.PersistentVector$TransientVector.<init>(PersistentVector.java:534)
2021-02-22T07:38:49.949128+00:00 app[worker.1]: at clojure.lang.PersistentVector.asTransient(PersistentVector.java:139)
2021-02-22T07:38:49.949155+00:00 app[worker.1]: at clojure.lang.PersistentVector.create(PersistentVector.java:108)
2021-02-22T07:38:49.949178+00:00 app[worker.1]: at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:46)
2021-02-22T07:38:49.949205+00:00 app[worker.1]: at clojure.core$vec.invokeStatic(core.clj:377)
2021-02-22T07:38:49.949236+00:00 app[worker.1]: at clojure.core$vec.invoke(core.clj:367)
2021-02-22T07:38:49.949263+00:00 app[worker.1]: at clojure.core$map$fn__5866.invoke(core.clj:2753)
2021-02-22T07:38:49.949286+00:00 app[worker.1]: at clojure.lang.LazySeq.sval(LazySeq.java:42)
2021-02-22T07:38:49.949311+00:00 app[worker.1]: at clojure.lang.LazySeq.seq(LazySeq.java:51)
2021-02-22T07:38:49.949351+00:00 app[worker.1]: at clojure.lang.RT.seq(RT.java:535)
2021-02-22T07:38:49.949371+00:00 app[worker.1]: at clojure.core$seq__5402.invokeStatic(core.clj:137)
2021-02-22T07:38:49.949420+00:00 app[worker.1]: at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24)
2021-02-22T07:38:49.949445+00:00 app[worker.1]: at clojure.core.protocols$fn__8146.invokeStatic(protocols.clj:75)
2021-02-22T07:38:49.949482+00:00 app[worker.1]: at clojure.core.protocols$fn__8146.invoke(protocols.clj:75)
2021-02-22T07:38:49.949501+00:00 app[worker.1]: at clojure.core.protocols$fn__8088$G__8083__8101.invoke(protocols.clj:13)
2021-02-22T07:38:49.949537+00:00 app[worker.1]: at clojure.core$reduce.invokeStatic(core.clj:6828)
2021-02-22T07:38:49.949555+00:00 app[worker.1]: at clojure.core$into.invokeStatic(core.clj:6895)
2021-02-22T07:38:49.949602+00:00 app[worker.1]: at clojure.core$into.invoke(core.clj:6887)
2021-02-22T07:38:49.949643+00:00 app[worker.1]: at incognito.fressian$record_reader$reify__7959.read(fressian.clj:16)
2021-02-22T07:38:49.949690+00:00 app[worker.1]: at org.fressian.FressianReader.handleStruct(FressianReader.java:695)
2021-02-22T07:38:49.949735+00:00 app[worker.1]: at org.fressian.FressianReader.read(FressianReader.java:501)
2021-02-22T07:38:49.949806+00:00 app[worker.1]: at org.fressian.FressianReader.readObject(FressianReader.java:274)
2021-02-22T07:38:49.949856+00:00 app[worker.1]: at incognito.fressian$reify__7986$fn__7987.invoke(fressian.clj:54)
2021-02-22T07:38:49.949921+00:00 app[worker.1]: at clojure.core$map$fn__5866.invoke(core.clj:2753)
2021-02-22T07:38:49.950009+00:00 app[worker.1]: at clojure.lang.LazySeq.sval(LazySeq.java:42)
2021-02-22T07:38:49.950091+00:00 app[worker.1]: at clojure.lang.LazySeq.seq(LazySeq.java:51)
2021-02-22T07:38:49.950138+00:00 app[worker.1]: at clojure.lang.ChunkedCons.chunkedNext(ChunkedCons.java:59)
2021-02-22T07:38:49.950181+00:00 app[worker.1]: at clojure.core$chunk_next.invokeStatic(core.clj:708)
2021-02-22T07:38:49.950250+00:00 app[worker.1]: at clojure.core.protocols$fn__8154.invokeStatic(protocols.clj:137)
2021-02-22T07:38:49.950319+00:00 app[worker.1]: at clojure.core.protocols$fn__8154.invoke(protocols.clj:124)
2021-02-22T07:38:49.950379+00:00 app[worker.1]: at clojure.core.protocols$fn__8114$G__8109__8123.invoke(protocols.clj:19)
TimoKramer commented 3 years ago

Hi @levitanong, I am trying to make a sense of the problem here. It seems like a dependency issue to me, at least for the IllegalArgumentException. Do you have a way to reproduce the problem? If yes, could you share some code on how to do that?

livtanong commented 3 years ago

There's no way to reproduce it reliably I'm afraid. The illegalargumentexception issue only happens once in a while in production on heroku.

whilo commented 3 years ago

@levitanong To agree what the expected behaviour should be: The JVM experiences an OOM error and the process is terminated or at least there are no further transactions executed. After an OOM error the behaviour of the JVM is undefined. It often keeps running fine after, but it can fail in a critical section. I think what can have happened is that not all index bits were written because of the OOM and then the error occurs during deserialization. I am thinking of how to make sure that Datahike is not transacting anymore after it experienced an OOM and whether there is an edge case where we still do. Is this scenario consistent with your mental model of the problem?

livtanong commented 3 years ago

I had no mental model of the root cause of the problem, (only that the two might be related) but your hypothesis sounds sensible! Sounds like it would be consistent with the fact that exporting seems to work fine, and that importing "fixes" the runtime, since I suppose indices are generated and have no need to be exported.

kordano commented 3 years ago

Could this be closed?

livtanong commented 3 years ago

I'm no longer using datahike, so it's very likely irrelevant at this point. I suppose this could simply be reopened next time someone encounters this again.

whilo commented 3 years ago

@levitanong Ok, no problem. Which feature or change would make Datahike interesting to you again in the future?

whilo commented 3 years ago

@kordano We have not done OOM tests yet. I think we need to setup a test case where we right to the database while filling up memory and then see whether we can recover after the OOM. I guess that @levitanong does not have a reproducible base case flying around though :smile: .

livtanong commented 3 years ago

@whilo Probably ClojureScript support. I'd be hoping writes would be more performant than in datascript. I've actually had to switch to a 2-store model (datascript, plain cljs maps) because it took a minute to transact 20k entries whose schema entity was particularly well connected. I'd also be interested in an async query model to allow for interop with IndexedDB. In general though, even if none of this happened I'd still be interested because everyone needs an open-sourced datomic alternative.

whilo commented 3 years ago

20k Datoms should take a few seconds max with the latest development branch (next release). What do you mean with entries?

livtanong commented 3 years ago

Ah, by that I mean data that corresponds to a row in a table. So I guess that would be an entity and all its data. It would be closer to 200k datoms in that case. I've asked around and done some exploration and it seems the index generation is really what slows things down because of all the relationships.