CQRS Question about bulk import

blissi commented 6 years ago

Hallo @adrai ,

sorry for this not being a completely specific question about node-cqrs-domain, but in some parts a broader question about CQRS design...I think you are very experienced with CQRS and I have a question about how to model something in CQRS that I have been thinking the whole day, but have not come to a conclusion yet.

My problem is this: The user should have the possibility to import a CSV that contains "shipping numbers". To prevent duplicate imports, he must not be allowed to import a shipping twice. Later on, the shippings that are imported will be enhanced with tracking data from the carrier.

Problem 1: my first naive attempt was to have one aggregate for one shipping. That is, if the CSV contains 1000 shippings, then 1000 "importShipping"-commands would be issued. But then, how can I prevent duplicate shipping numbers? Would you place this into a pre-condition? If so, would the pre-condition query the events-collection or would it query the aggregates-collection?

Problem 2: another approach would be to have one aggregate that models the list of shippings that have been imported. So it is really easy in JavaScript code to check this - simply look in the list of existing shippings of the aggregate if there is one with the new shipping number and reject it. Big problem here: we use MongoDB and we will have customers with lots of shippings, so that the 16MB limit of MongoDB per document would be exceeded. From the documentation I didn't see the possibility to "split" an aggregate to multiple documents, right? This would also have concurrency issues, right?

Another approach I was thinking of was to allow multiple aggregates with the same shipping number and do the de-duplication in the read-model. Would that be a viable approach in your oppinion?

Many thanks in advance, Steven

adrai commented 6 years ago

Hi Steven, all of this approaches can "work".

About the solution proposal of having only 1 aggregate: You could have 1 import command (containing all shippings) and then generate various shipping events. The mongodb will then store only the single events and the aggregate business rules will only need to verify everything in memory (no complete shipping list is persisted). To optimize loading of passed shipping events from the eventstore you can make use of the snapshot mechanism (by saving a list of shipping numbers only), that would speed up the loading when handling a new import command.

About the solution proposal of solving this outside of the domain (the read model): If your system can live with a minimum possibility of duplication (by concurrency) risk, you can solve this outside of the domain. In case you want to make sure also this risk will be eliminated, you can create a saga, that will check after the domain has processed the import, if there are any duplicates and fix it by sending fixing commands.

And finally always remember: "You don't have to solve everything with CQRS." ;-)

I hope this helps a bit, Adriano

nanov commented 6 years ago

Hi,

you have some misconception about how ES works, I suggest that you dive a little bit deeper into the theory of Event Sourcing, there is plenty of information readily available over the web presented in many different ways.

A few things to note:

There is no collection of aggregates, each aggregate has it's own event stream and this stream is applied every time when a new command arrives to build the current state of the aggregate.
The state of the aggregate is modeled inside the event-handlers, there you can do aggregate.set and affect the state. Again, those handlers are applied in their respective order each time a new command arrives in order to build the current state of the aggregate.
Pre-conditions do not query anything ( in fact - no queries are done on the write side, ie. domain), per-conditions are executed after the state has been rebuild and in them you can perform some checks against the state of the aggregate ( having the command and state data), and respectively reject the command if some condition is not satisfied.
Each event is stored in a separate record ( in MongoDB case separate document for each event ), which consist of stream id ( aggregate id ), some generic event-store data plus the specific events data ( payload, metadata etc. ) and only that ( no state information whatsoever ). This means that MongoDB ( or any other DB ) document size limitations do not apply on the aggregate, but rather would only apply on the event level ( very unlikely to present a problem, and if does it is a sign of bad design ).

With that being said, to answer your specific question, as i get it, has nothing to do with "bulk import", as the same logic applies to each individual shipping - no matter if it comes trough bulk or single operation. Weather you should model your domain with a shippings ( many ) or shipping ( one ) aggregate is hard to say, depends on what business logic you will perform with those afterwards, in both scenarios you can denormalize those into separate readmodels so that shouldn't be a consideration.

If you choose to go with the shipping ( one aggregate per shipping ) way - you can control the uniqueness of the shipping with their ids. Combining the defineCommandAwareAggregateIdGenerator aggregate option and the exsiting: true command option you could prevent a shipping being created twice.
If you choose to go with the shippings ( one aggregate for all shippings ) way - you should maintain a list of all shippings ids, and check against this list inside a per-condition. Bear in mind, that if you have a huge amount of shippings you may encounter some memory issues here because, as said, state is rebuild in-memory.

I personally wouldn't choose the one-aggregate-for-all approach, as you would probably need to maintain some other state for each shipping ( position, status, delivered - for example ) and with such approach this would be much more complicated and less performant, before mentioning scaling and distribution limitation.

nanov commented 6 years ago

@adrai : we were responding concurrently, apparently it took me more than 10min to write my response. :)

Another approach that could work, using an aggregate per shipment, would be to do this check inside a business rule, as this can be asynchronous you can maintain an external collection with all shipping numbers and validate weather one is already present using mnogo atomic operators ( update, upsert ) and then in case it was there reject the command with a BussinessRuleError, you should be careful the check those only for create/import command. This way you don't have to tie your shipping ( aggregate ) ID to the shipping number. One thing to take into consideration is that, in theory, there is no way to ensure that the first shipment of two with the same number will be accepted and the second refused, it could go the other way around.

If there is a need to reject the whole import in case any of the shipments fails, than this should be handled with a saga, if this is not the case an import could simply consist of a service who is firing a create/import command per each shipment, load-balancing can be handled either on the message-bus level or on the sending side.

blissi commented 6 years ago

@adrai / @nanov Thanks a lot you for your thorough explanations! I will try the follow approach now: add an additional collection with the shipping number as a unique key. There will be one import command for each shipping, and in the business rule I will add the shipping number to this collection. If there is already one, the business rule will fail because of the duplicated key.

That will save me from the problem that the shipping numbers list can get too large, so that the aggregate document in the snapshots-collection couldn't be saved because of the 16MB limit. Plus, I don't need to fetch the whole list from the DB server to evaluate if the shipping number is allowed.

adrai commented 6 years ago

sounds good

thenativeweb / node-cqrs-domain

CQRS Question about bulk import #134