Actor query when processing apply() returns intermediate state

bwehrle commented 5 years ago

I have a OrderEntity that when constructed represents an order without any items. It then has an apply(CreatedEvent) which loads the items. Afterwards, the client checks to see what's in the order. In some cases, the client query sees the entity with an empty order, which according to how stowage should work, should not be the case.

Repro:

Remove all local vlingo JARs
Clone https://github.com/bwehrle/vlingo-examples.git
git checkout feature/event-sourced-example
mvn clean test -pl vlingo-ecommerce
Expect: all pass
Observed: From time to time OrderResourceShould.createdOrderContainsProduct_inParallel fails.

This same test cases, when run from IntelliJ fails almost every time. It is checking that each created order has the products that were put in with the created event when run in multiple threads. I was able to make this fail in a single thread as well but it happened less frequently or only from IntelliJ.

I have:

Reviewed how the journal and EntryAdapters are configured
Compared my Bootstrap code with the examples
Compared how my Resource is doing the creation with how its implemented in the ForumEntity etc.

VaughnVernon commented 5 years ago

@bwehrle Thanks for reporting. A quick look at your test indicates that you are creating multiple Orders all with the same id. That would seem to have a big part in this. Also, I am not certain what Completes<Void> is meant to accomplish, because the protocol could be either void or Completes<OrderInfo>. If the latter you could use the apply(..., andThen) parameter as follows, and your client would see the correct value:

apply(OrderEvents.Created.with(state.orderId, userId, quantityByProduct), () -> state.createOrderInfo());

bwehrle commented 5 years ago

Hi @VaughnVernon , The Void was just a way to let the client wait on the application of the event or to know when it was done. It is (or should be) superfluous. The OrderResource class is using the following code to generate the Id of the order.

I printed out the Ids and they are all distinct. The UserId is the same; that addition generates a new State when the event OrderEvents.Created is applied. The new State takes the existing id, and includes the UserId and the order items. I don't see the issue you are describing.

I agree I could do as you say (returning the state as a Completable), but since you said that stowage of messages was applied, this would seem to be a work-around, leaving the issue itself.

  final Address  orderAddress = addressFactory.uniquePrefixedWith("order-");
        ...
  Order orderActor = stage.actorFor(
                Order.class,
                Definition.has(OrderEntity.class,
                        Definition.parameters(orderAddress.idString())),
                orderAddress);

VaughnVernon commented 5 years ago

@bwehrle Thanks for your patience. I do see a failure very reliably at every fifth test run (!?!), but I don't think the cause is what you expected me to see. It's really strange that it doesn't exhibit the failure on any other run. That's just crazy :)

Where I see the consistent failure is in the task count assertion:

Assert.assertEquals(TASK_COUNT, fList.stream().filter(OrderResourceShould::getOrHandleWithFalse).count());

This seems like a very unreliable assertion, asking the JDK's ExecutorService to acknowledge an absolute count of running tasks at the time your loop finishes. This is testing that your loop ran 10 times, but it is safe to assume that it did. The assertion failure you are seeing may be masking this actual cause.

One possibility is that your ExecutorService assertion on count() could literally be succeeding, but the ordering of execution that the CPU chooses could prevent your thread from seeing true because your thread's view of that data has not yet been refreshed (flushed) from CPU cache to main memory. I have seen this problem exhibited a lot in test code when there is not a fence around the state in question. Your thread should not be accessing the state until it has a lock on the data that you and the other threads agree on. This can be accomplished with a synchronized (lock) { access resource } around the data in questions, but everyone must use it. For example, make your getOrHandleWithFalse() increment a count that is surrounded by a mutual lock (as above) and your collection of data and then your assertion on that data must be ordered around that. Otherwise you are going to see inconsistent CPU-caused memory updates like this in many tests.

Note that this is not the case when using Completes<T> because the value from the one thread is being written into memory AND delivered directly to a registered Completes<T> function, such as andThen(value). In other words, the andThen(value) is now running on the thread that delivered the completed value. In vlingo-actors this is one of the CompletesEventually actors that are available in the pool.

As a followup I worked on your test code for a while and did made changes that will help you to see the potential for concurrent data access among multiple threads that won't see the same data in the sequence that appears to be reliable. It's mind boggling until your understand that modern CPUs execute instructions out of order from the actual code. It's part of CPU optimizations, which doesn't make multi-threading any easier.

So in your future tests try what I have said and demonstrated in your code. I am almost certain your problem will go away, as they did with my modifications.

bwehrle commented 5 years ago

Hi @VaughnVernon , What is weird for me is that I could reproduce this problem using a single thread when running my test from time to time. The single threaded case was very basic - so I made it run more frequently to get more executions and hopefully reproduce the issue more frequently.

My test seems to have a classic case of unsynchronized data access. Thanks for pointing out the issue!

VaughnVernon commented 5 years ago

Even with one test thread there is at least one other thread at play to deliver the current message to the Actor. The access of data touched by both threads must be synchronized or enough time must pass for the reading thread to see updates in main memory. Again, these two statements can happen out of order:

atomicValue.set(value);
until.happened();

Per the Java threading model, Java running on modern Intel cores the until.happened() can clear the CountDownLatch before the atomicValue is actually set. You may read the above sequence of code statements and say "there is no way that the until can clear before the atomicValue is set" and you'd be wrong :) That's why you need the following:

synchronized (lock) {
  atomicValue.set(value);
  until.happened();
}

and:

synchronized (lock) {
  return atomicValue.get();
}

Of course the atomicValue being set inside is already volatile but the problem is sequence of execution, not data on the bus not reaching main memory yet. The atomicValue will absolutely be set in main memory as soon as the set instruction is executed. It's just that you can't predict when it will be executed, either before or after until.happened(), and you need the synchronized (lock) around both threads' access to the atomicValue to force Java to ensure that the effects of out-of-order execution aren't seen by reading thread.

See the following. In fact the Wikipedia article is clearer on the subject of sequence of execution.

https://dzone.com/articles/memory-barriersfences https://en.wikipedia.org/wiki/Java_memory_model

"The major caveat of this is that as-if-serial semantics do not prevent different threads from having different views of the data."

HTH.

VaughnVernon commented 5 years ago

@bwehrle I want to mark this closed. There are two improvements: (1) Completes<T> has been fixed to reactive to errors with latent error handler registry inside BasicCompletes<T>. (2) The addition of AccessSafely supports thread-safe access of in-test state. It would help if you could confirm this.

VaughnVernon commented 5 years ago

I have confirmed fixes and since @bwehrle has not yet confirmed I am closing.

bwehrle commented 5 years ago

Hi @VaughnVernon, I re-ran these tests inside my example using 0.8.3 and could no longer repro the issue. Thanks for fixing this issue!

vlingo / xoom-actors

Actor query when processing apply() returns intermediate state #40