Support JSON serialization of persistent payload

devluencer-089 commented 9 years ago

We are using your mongo persistence plugin in a CQRS setup and so far it working smoothly. Great job!

One "complaint" though is, that PersistentRepr and its payload (events) are stored in binary format, e.g.

{
  "_id": ObjectId("54b846f5e50895665d308cc0"),
  "pid": "MyEntity-0815",
  "sn": NumberLong(2),
  "cs": [

  ],
  "dl": false,
  "pr": BinData(0, "CkgIARJErO0ABXNyABpjb20uZ2ltYi5jb21tb24uZXMuQ29uZmlybQlACa6R6c/CAgABSgAKZGVsaXZlcnlJZHhwAAAAAAAAAAEQAhotQ2hlY2tvdXQtZjk3NGU5OGMtMWNhZi00MWRjLWJhZTEtZGNhMmVkZWZhNzU3IAAwAEAAWmBha2thLnRjcDovL0NoZWNrb3V0QWN0b3JTeXN0ZW1AMTI3LjAuMC4xOjI1NTIvdXNlci9zaW5nbGV0b24vY2hlY2tvdXRUb3BpY0FnZ3JlZ2F0b3IjLTE0Nzc3MDU1NjI=")
}

I think it is a fundamental feature to be able to query the event store (during development and production) and have a JSON-representation of all events when using MongoDB.

Event Store (https://github.com/EventStore/EventStore.JVM) does not seem to be a very mature plugin, but it does provide support for custom JSON serialization. Martin Krassers Kafka Plugin too.

Any way you can provide support for JSON serialization?

devluencer-089 commented 9 years ago

PersistentRepr.sender seems to be scheduled for removal and I would not mind having a binary representation in the journal as long as the payload is JSON.

https://github.com/akka/akka/issues/16542

scullxbones commented 9 years ago

Hi and thanks for using the journal. Always great to hear about people actually using your code.

I'll check out how Martin is doing it. I had originally planned on using pure JSON serialization, but it has some significant implications for dealing with schema evolution (as far as object serialization) and long lived journals. All I control is the wrapper serialization format, not what the user uses for the payload, nor what the Akka core team requires for the core classes. I think this means that i'd have to implement my own serializers for the core persistence classes and expect that users be able to maintain backward compatible payloads.

What was the use case for the data being saved as JSON? Was it just that you want to be able to easily read the payloads in the journal? There may be other options if so.

devluencer-089 commented 9 years ago

I assume schema evolution in the context of Event Sourcing is the ability to "upcast" event objects without breaking serialization if events classes change over time. This is the very reason I use a custom JSON-payload serializer for my events. (PersistentRepr.payload)

I would argue that JSON is the most resilient way to serialize events. I could rename the event class (we use Java btw), rename all properties, add/remove properties and still be able to (de)serialize my objects. If we decide to change a String-property to another datatype, we would still be able to properly serialize. http://jackson.codehaus.org gives us complete control over how to serialize our events. The default protobuf-serializer would choke at some point. Greg Young also mentioned JSON as a good way to store events.

Our use case for using JSON is a) "to easily read the payloads in the journal" b) being able to query and datamine the journal

I am no expert when it comes to MongoDB and maybe it is just my personal notion that binary does not bode well with documents in a MongoDB.

If there are other options to achieve a) and b), please let us know. Any advice is appreciated.

jochenchrist commented 9 years ago

1

Agree, the payload should really be serialized as JSON / BSON in mongodb to query and view in any mongo ui.

This feature is elementary for any production deployment.

scullxbones commented 9 years ago

I think that directly coupling logic in an application to the data storage format of a library being used is not ideal. I've had first hand experience with this in the past, and at the time did what I could to remove that coupling. The issue is one of abstraction and encapsulation. Should the data storage format change (and the one with this library will soon, see open issue #5 ), then any reports and/or scripts that expect a certain format will break. This would cause a need to rewrite all of the tightly coupled reports against the new format should one want to move to the next version of the library. In addition, the coupling causes lock-in. For example, should your application get big enough to outgrow mongo and need something that scales better like cassandra, again you'd need to rewrite the reports to go against the new data format.

As I read it, every concern you've had so far can be addressed by the Q side of CQRS. Currently akka has the PersistentView concept which would give a replay of all events that hit the log. This adds no additional dependency surface other than Akka. Persistent Views can be used immediately with no changes to the library.

There are issues with views, that have been well covered on the akka-user list. So the second option, which would have a bit of lock in for now, but is the way forward per Roland Kuhn detailed in akka-user involves querying the journal using Akka streams. This must be supported by journal plugins, so it would really just be an early preview, not something custom i've dreamed up. This would be some additional work, but would provide a query API with which you could do whatever you needed with the payloads.

Third, i've considered in the past (and have an open ticket #11) which follows in Martin Krasser's footsteps on defining topic-based materialized (capped) collections. Given these are a user-defined thing, and not really the primary storage format, i'd feel a lot better about exposing an API to provide a custom serializer. This custom serializer would need to support a read/write interface that meshes with the particular driver that is being used, be it Casbah or ReactiveMongo. This is a 2-part API, one part that writes to a capped collection, and another that reads. The second (read) half of the API could be skipped if you just wanted to tail the capped collection.

Do any of those sound like they'll work for you?

devluencer-089 commented 9 years ago

Thank you for getting back to me on this so quickly and for your elaborated answer. I see you have put some thought into this issue.

First of all let me say, that this is a feature request and it is up to you as the plugin vendor to support it.

I am following the discussions on the user list very closely and I am looking forward to see the Q in CQRS put to more use with streams and other improvements. We are making heavy use of PersistentView in your application and I am aware of the current limitations.

On your statement about coupling application logic to the data storage format: I fail to see how this relates to my request of using JSON-serialization. Are you suggesting JSON introduces a coupling? The very opposite is the case. By using JSON as the serialization format I can decouple my domain objects from its persistent representation. There is no class or type information required to (de)serialize objects. Also, JSON is understood by every modern language. If your persistent model and your domain model diverge, at some point you will face problems. JSON is very resilient and so are protocol buffers. I would argue that JSON makes more sense for a document-based storage solution.

As far as vendor lock-in is concerned, I don't see how introducing JSON will make things worse. If I decide to switch our storage vendor (let's say to CouchDB), I will still have to do a migration, e.g. by replaying my events into the new storage. This is a problem independent from the serialization format.

If you read though the posts on CQRS-threads, you'll notice that it is perfectly fine to support (persistence) plugin-specific functionality on-top of the common journal API. Martin Krasser does it with KAFKA (publishing to user defined topics), so does EventStore (serialize in JSON). One plugin-specific thing I think is reasonable (or natural) to have for a MongoDB-plugin is to support JSON/BSON. You see a similar requests here: https://github.com/ironfish/akka-persistence-mongo/issues/88

I am aware that plugin specific extensions are a vendor lock-in, but I am more than willing to accept this.

Event streams on top of akka streams are a nice feature, but that does not solve my issue with a) and b). I simply want to open up my MongoDB client, make a query against the database and being able to see what's in there (events).

  "pr": BinData(0, "CkgIARJErO0ABXNyABpjb20uZ2ltYi5jb21tb24uZXMuQ29uZmlybQlACa6R6c/CAgABSgAKZGVsaXZlcnlJZHhwAAAAAAAAAAEQAhotQ2hlY2tvdXQtZjk3NGU5OGMtMWNhZi00MWRjLWJhZTEtZGNhMmVkZWZhNzU3IAAwAEAAWmBha2thLnRjcDovL0NoZWNrb3V0QWN0b3JTeXN0ZW1AMTI3LjAuMC4xOjI1NTIvdXNlci9zaW5nbGV0b24vY2hlY2tvdXRUb3BpY0FnZ3JlZ2F0b3IjLTE0Nzc3MDU1NjI=")

This is a big issue for us. We are not able to tell what that binary thing represents. We cannot query by event type or any other criteria. It's a black box. Implementing a persistent view is not a practicable workaround. I just want to fire up my Mongo Client and get an impression of what events have been written by PersistentActors.

Would you rethink providing a configuration hook for a custom JSON serializer or are you dismissing this idea?

devluencer-089 commented 9 years ago

Btw, your plugin information is outdated: http://akka.io/community/ I think more people would choose your plugin if the information was up to date.

devluencer-089 commented 9 years ago

To provide an example, I picture something like this:

{
  //  "_id": ObjectId("54b846f5e50895665d308cc0"),
  //  "pid": "MyEntity-0815",
  //  "sn": NumberLong(2),
  "cs": [
  ],
  "dl": false,
  "pr": {
    "processorId": "MyProcessor-123456",
    "persistenceId": "MyEntity-0815",
    "sequenceNr": 2,
    "deleted": false,
    "redeliveries": 3, //will be removed
    "getConfirms": [
      //will be removed
    ],
    "confirmable": false, //will be removed
    "confirmMessage": , //will be removed
    "confirmTarget": BinData(0, "CkgIARJEr) //will be removed
    "sender": BinData(0, "CkgIARJEr), //don't care, will probably be removed
    "redeliveries": 3,
    "payload": {
      "reserved": {
        "seatType": "FullConference",
        "quantity": "5"
        "sequenceNumber": 2
      }
    }
  }
}

The configuration should support this via: akka.contrib.persistence.mongodb.mongo.journal.json-serialization = "on" akka.contrib.persistence.mongodb.mongo.snaps.json-serialization = "off"

Sure this is a naive and incomplete solution, but you get the idea.

I already use a custom JSON-serializer for my payload btw. It's just not very useful atm.

    serializers {
      json = "com.xxx.es.serialization.JsonEventSerializer"
    }
    serialization-bindings {
      "com.xxx.es.Event" = json
    }

aisven commented 9 years ago

Given that events are persisted in MongoDB, it would be a great benefit to be able to use MongoDB interfaces and tooling to carefully observe and query, in a direct way, the events stored in the event store, including their names and attributes.

It would benefit our development right away.
It would also enable us to tailor and leverage solutions around MongoDB for production.
Note that write access via other channels than the akka-persistence-mongo plugin could be blocked by proper configuration of the MongoDB in a particular setup (regarding event store databases and collections).

adotor commented 9 years ago

+1 for JSON serialization

The binary format of the events is currently a big issue for my organization. I am just looking for a solution that allows me to query the events inside the MongoDB (e.g. with a Mongo client).

scullxbones commented 9 years ago

On your statement about coupling application logic to the data storage format: I fail to see how this relates to my request of using JSON-serialization. Are you suggesting JSON introduces a coupling?

No. I'm saying that the payload, the application content that you want to get to is wrapped in several other layers, the Akka layer, as well as the library's layer. These wrapping formats must be able to change over Akka's evolution, over this library's evolution. It's one thing to say that it would be nice to see the contents of the journal at some point in the development cycle. If on the other hand, reports are created, coupled to the format (i'm thinking of the data-mining use case now), that spells trouble when either of the wrapping layers' formats change.

As an example, for #5 in order to maintain atomicity in context of mongo's constraints each batch of data needs to be written as a single document, rather than N documents. This means that each mongo record now has a subdocument for each eventsourced event in that particular batch. If someone had written a report around the old format of one eventsourced event per mongo document, it would be quite broken by that change.

As far as vendor lock-in is concerned, I don't see how introducing JSON will make things worse. If I decide to switch our storage vendor (let's say to CouchDB), I will still have to do a migration, e.g. by replaying my events into the new storage. This is a problem independent from the serialization format.

I agree, but it's at least limited to data migration, which I agree is unavoidable. The lock-in I was referring to was written code, functionality that must have a parallel in the system being migrated to.

Event streams on top of akka streams are a nice feature, but that does not solve my issue with a) and b). I simply want to open up my MongoDB client, make a query against the database and being able to see what's in there (events).

I think they would solve both. If the streams are written to a mongo collection in a format controlled by the application, then it is totally in the application's control what they look like and are fully queryable by a mongo client. There's no danger of this library making a change that causes users to be unable to query their view of the journal, as long as the library respects the event stream API.

Would you rethink providing a configuration hook for a custom JSON serializer or are you dismissing this idea?

I haven't dismissed it, and from reading this thread, there's clearly a few people looking for the functionality. I haven't seen this level of activity on this project before. I'm just trying to talk it through, to see if there's another way to fulfill the use case. I see this as a slippery slope, and want to make sure there's no alternative.

aisven commented 9 years ago

It's one thing to say that it would be nice to see the contents of the journal at some point in the development cycle.

If on the other hand, reports are created, coupled to the format (i'm thinking of the data-mining use case now), that spells trouble when either of the wrapping layers' formats change.

People are certainly not interested in functional report generation, data mining or big data analysis via a direct access to the events in the event store, which is where we all agree in this thread I think. In CQRS+ES we always do such things on the read side, e.g. via event stream consumers and/or read models with projections.

However, people do see those technical use cases: in development, but also in operations regarding test systems and production (issue analysis, monitoring with functional aspects, analysis of event structure evolution over time...).

scullxbones commented 9 years ago

Ok. It's "on the list" - I should be able to get to it when I start versioning the documents, which I'm working on now.

jochenchrist commented 9 years ago

Still interested. Could you make any progress?

Also have a look at the discussion at https://github.com/ironfish/akka-persistence-mongo/issues/88

scullxbones commented 9 years ago

I've made some progress, but have stalled due to juggling two side projects. Let me see if I can scrape some time together this weekend.

alari commented 9 years ago

I've made a damn stupid solution.

The next thing I'm going to make is bson/play-json/case class serialization helper trait for persistent actors. Then I'll be able to store command objects in persistent log.

I think this is the case: get commands from end-users, validate them, and then store as-is.

toggm commented 9 years ago

I agree with @scullxbones that the journal event stored shouldn't be used for querying and data analysis directly. Using a persistentview isn't flexible enough due to the need to have a persistentview per aggregateroot and all aggregateroots writes into the same journal event store. I've found the following blog post which couple CQRS events logs written to cassandra with Apache Spark (http://www.cakesolutions.net/teamblogs/using-spark-to-analyse-akka-persistence-events-in-cassandra) which may resolve the needs to querying the event log.

On the other hand json based serialization would be great for simple debugging purposes. Today it's very hard to trace back event based errors. So I support this feature request as well.

alari commented 9 years ago

@toggm querying event logs is awful idea. But being able to read your events, or to simplify extending of event classes, is useful.

Now I generate a read model from my persistent actors as a (normalized) collection per domain. So events could be used to generate another read model, but not queried directly.

BTW I already use my PR in production.

scullxbones commented 9 years ago

Fixed with @alari 's pull request. Thanks @alari !

scullxbones / akka-persistence-mongo

Support JSON serialization of persistent payload #16