nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

Support for log compaction? #373

Open tchap opened 7 years ago

tchap commented 7 years ago

Hi,

I have stumbled across Apache Kafka's log compaction, which is really handy for Event Sourcing. I am wondering whether something like that will ever be possible with STAN since there is no notion of message key...

Cheers!

kozlovic commented 7 years ago

This starts to come up more and more and we will be looking into it for sure! As you mentioned, NATS Streaming does not have a concept of message key, and would need to be able to handle sequence gaps in a message log, which it does not at the moment. Keeping this open and adding Enhancement Request label.

buyology commented 7 years ago

This would be very good to have, with the primary use case to control log/stream size growth when you persist data for the longer term and to avoid having to read (all too many) outdated records during replay.

Another thing that would be nice to have is the ability to delete a record by producing it with the message key + empty payload (as in Kafka) or alternatively with a certain flag to mark it for deletion.

tylertreat commented 7 years ago

One key use case Kafka has for compaction is storage of consumer offsets. In the old days, you had to store offsets in ZooKeeper. Today, they can be stored directly in Kafka, and the way Kafka does this internally is simply store them in a topic and treat them like any other message. Key compaction means only the latest offset is stored for each consumer.

This goes into internal details of NATS Streaming, but I would love to be able to do something similar rather than having the server track all of this state for subscriptions. Not only does it reduce a lot of server mechanism, but if we require clients to track offsets, we can have client libraries periodically checkpoint their positions, which would improve performance quite a bit I think.

larskluge commented 6 years ago

log compaction is by far my most desired feature. I'm using NATS Streaming roughly for a year now in production and for our use-case it works so much more beautiful than Kafka. Unfortunately without log compaction the work spent on extra services to essentially just do "external log compaction" is quite significant. Wondering if there is any progress on this? Thank you!

savaki commented 6 years ago

We ran into this problem as well. What we're working on for an interim solution is to have a nats consumer off to the side generate a key/value store. Periodically, the consumer takes a snapshot in avro format and publishes that to a known location (S3). Anyone who wants to replay the compacted topic would be required to first consume the most recent avro file (which would include the sequence) and then connect to stan using the sequence number provided. If there's interest, I'd be happy to share this one it gets into a reasonable state.

clintberry commented 4 years ago

@savaki - I know you posted this a long time ago, but I would definitely be interested in this if you still are willing to share.

nalinpai commented 2 years ago

@tylertreat I came across https://bravenewgeek.com/tag/building-a-distributed-log-from-scratch/ where you talk about carrying the message key as metadata in an envelope. Is this something that's already supported?

ealeykin commented 1 month ago

Let's say you put some identity to subject name e.g. my.event.1234, where 1234 is the key to use for 'compaction'. Then you can configure stream with MaxMsgsPerSubject=1 and DiscardPolicy=DiscardOld.

With such a configuration you will keep only the latest message per subject.