quickfix-j / quickfixj

QuickFIX/J is a full featured messaging engine for the FIX protocol. - This is the official project repository.
http://www.quickfixj.org
Other
961 stars 616 forks source link

Seems to be a bug in sequence number saving in JDBC #357

Open nolag opened 3 years ago

nolag commented 3 years ago

Describe the bug A clear and concise description of what the bug is.

A lock is taken by the session on either sending or receiving, but an update in the JDBC store updates both sequence numbers.

To Reproduce

I can't make the race condition happen, so it's possible I'm missing something. In theory, stop the store from saving in sending and then receive a message. Once the receive stores sequence numbers, the send should revert the receive's number

Note: you can reverse send/receive here.

Expected behavior

Neither rolls back the other.

chrjohn commented 3 years ago

Hi @nolag , thanks for the report. Am a little swamped in work at the moment. Are you able to provide a PR or specify the exact position in the code to ease things a little for me? Thanks, Chris.

nolag commented 3 years ago

Sorry for the delayed response. JDBCStore function storeSequenceNumbers line 278-282 (copied below [1]) update both sequence numbers, but the lock taken in Session are only for one of the sequence numbers (on line 1625, 2169, 2578). If the first lock is taken, and updating, then the second runs faster the first would overwrite the value of the second.

[1] update = connection.prepareStatement(SQL_UPDATE_SEQNUMS); update.setInt(1, cache.getNextTargetMsgSeqNum()); update.setInt(2, cache.getNextSenderMsgSeqNum()); setSessionIdParameters(update, 3); update.execute();

QwertGold commented 3 years ago

I must agree with @nolag, it looks suspect, and I think there is a bug.

The locks are there to prevent concurrent modification. This applies to the increment of sequence numbers, but also the memory state that is stored inside the MemoryStore which JdbcStore uses as a cache.
After either sequence number is incremented in memory, both are read, and written to the DB. In most cases, the sender sequence number is incremented by an application thread, and target sequence number is incremented by QFJ Message Processor. Unless there is some other memory synchronization mechanism, we have a potential multithreading bug. QFJ Message Processor is only guaranteed to read the correct target sequence number. If the sender sequence number was modified since it last read it, there is no guarantee it will see that change, and I believe it is theoretically possible that it could write an old value to the DB.

Assuming there is a bug, why don't we see failures more often? When you receive a message you check if it has the expected sequence number. In most cases this is done by QFJ Message Processor, and since it is the same thread that increment target sequence number it will always read the correct value. So even if another thread occasionally stored a wrong value in the DB, that would only have an effect in case of a crash, and even if that occurred, it would just mean some extra PossDup when you reconnect. If the QFJ Message Processor writes an old sender sequence number to the DB, this would be corrected automatically when the next message is sent. There would only be an issue if the very last sender sequence number was overwritten, and you reconnected to the same session with a sequence number which was too low.

Sadly there is no way to provoke this kind of race condition, as it depends on a combination of CPU, OS and JDK. I have seen incorrect concurrent code that worked flawlessly for 10+ years on 100+ installations, and suddenly one client with a new combination of hardware/os/jdk encountered an infinite loop in a un-synchronized HashMap.get().

QwertGold commented 3 years ago

I have change my mind, I don't think there is a bug.

There was a change to the memory model back in Java 5. Synchronization actions, like volatile read/write, lock/unlock of monitors will effectively cause a flush of all cached modification (memory barrier). This means that it is now possible to write correct multithreaded programs using a single volatile field to control flushing before the next thread operates on a data structure. I think this is what they do in the LMAX Disruptor, to ensure that the next handler always see modification made by the previous handler(s).

I have not done a full analysis of the QuickFixJ code, but I think the following is true. 1) Every time the sender sequence number is incremented senderMsgSeqNumLock is locked 2) Only QFJ Message Processor is incrementing the target sequence number 3) QFJ Message Processor consumes messages from a LinkBlockingQueue

If this analysis is correct, my conclusion would be A) It is SAFE for any sender thread to call into the JdbcStore and read both sequence number, because a synchronization action would always occur when senderMsgSeqNumLock.lock is invoked. B) It is SAFE for QFJ Message Processor to read the both sequence numbers, because LinkBlockingQueue.poll is called prior to reading the sender sequence number, and since this causes a synchronization action, it will read the correct value.

nolag commented 3 years ago

I have change my mind, I don't think there is a bug.

There was a change to the memory model back in Java 5. Synchronization actions, like volatile read/write, lock/unlock of monitors will effectively cause a flush of all cached modification (memory barrier). This means that it is now possible to write correct multithreaded programs using a single volatile field to control flushing before the next thread operates on a data structure. I think this is what they do in the LMAX Disruptor, to ensure that the next handler always see modification made by the previous handler(s).

I have not done a full analysis of the QuickFixJ code, but I think the following is true.

  1. Every time the sender sequence number is incremented senderMsgSeqNumLock is locked
  2. Only QFJ Message Processor is incrementing the target sequence number
  3. QFJ Message Processor consumes messages from a LinkBlockingQueue

If this analysis is correct, my conclusion would be A) It is SAFE for any sender thread to call into the JdbcStore and read both sequence number, because a synchronization action would always occur when senderMsgSeqNumLock.lock is invoked. B) It is SAFE for QFJ Message Processor to read the both sequence numbers, because LinkBlockingQueue.poll is called prior to reading the sender sequence number, and since this causes a synchronization action, it will read the correct value.

Right, the sender sequence number is locked, we're incrementing the sequence number (non-volitile or atomic) on a thread.

Say the target goes to get updated after we read a message. It makes it to line 281 before the update sending number comes in. That thread isn't blocked by the queue, only a lock on the sender seq number so it can run in parallel. Sender sequence number is incremented by thread 2 and the correct (if we're lucky because the target could be read wrong on that thread) gets written by the sender. The target then continues to run and writes the sender number as 1 less than it is. The same function updates both the sequence numbers in the DB. If we change it to only update the one that it thinks was modified it would work.

QwertGold commented 3 years ago

hehe, I have changed my mind again ;)

I believe you are correct. When the sender thread acquires the senderMsgSeqNumLock we have what the Java spec calls a Synchronization action, this means that we are guaranteed to see any change performed by other threads before this point in time. As you point out the same is true for the receiver thread. It is possible for the sender to update the sender sequence number after the message has be taken from the LinkedBlockingQueue, and in that case, we could write an old value.

The chance of a race condition here is pretty small, but still theoretically possible. Your suggestion about only updating the sequence number controlled by the calling thread seem like a good solution.

Concurrent programming is hard! I still remember by teachers recommendation, "Start with one lock, it will make things much easier".

nolag commented 3 years ago

hehe, I have changed my mind again ;)

I believe you are correct. When the sender thread acquires the senderMsgSeqNumLock we have what the Java spec calls a Synchronization action, this means that we are guaranteed to see any change performed by other threads before this point in time. As you point out the same is true for the receiver thread. It is possible for the sender to update the sender sequence number after the message has be taken from the LinkedBlockingQueue, and in that case, we could write an old value.

The chance of a race condition here is pretty small, but still theoretically possible. Your suggestion about only updating the sequence number controlled by the calling thread seem like a good solution.

Concurrent programming is hard! I still remember by teachers recommendation, "Start with one lock, it will make things much easier".

It's worse, the synchronization action alone wouldn't fix it (though would help and is necessary). If one thread is about to write the correct value then the other rushes in with newer data it can still finish after. It would be extremely rare.

Ideally, the function should only update the number that changed not both. Then you're safe. Otherwise, you can sync (or otherwise lock) on the function itself.

I might be able to take a quick attempt to fix it next week.