Closed plameniv closed 5 years ago
The following comment in KAFKA-7088 mentions the same IllegalStateException:
That's interesting, so you are properly committing or aborting but discarding the client after one commit. I will take a look, thanks.
Running this locally and looking at the logs, this seems to be why the transaction state transition fails:
# Old state
TransactionMetadata(transactionalId=transactional-id, producerId=1, producerEpoch=1804, txnTimeoutMs=60000, state=Empty, pendingState=Some(Ongoing), topicPartitions=Set(), txnStartTimestamp=1551689126534, txnLastUpdateTimestamp=1551689126521)
#New state
TxnTransitMetadata(producerId=1, producerEpoch=1804, txnTimeoutMs=60000, txnState=Ongoing, topicPartitions=Set(test-topic-1551689052613-3), txnStartTimestamp=1551689126527, txnLastUpdateTimestamp=1551689126527)
Not sure how it's possible, but it looks like the current state has a txnStartTimestamp
that's higher than the txnStartTimestamp
of the state that it's trying to transition to.
Reading the Kafka code, I'm confused as to how this works. If I'm not mistaken, the timestamps are all set on the broker side:
If the request is handled by a different broker, or if the system clock changes, I would expect this to fail, as the clocks won't be synchronized. They even removed one such check here: https://github.com/apache/kafka/pull/3286/files
Looking at the logs from the reporter, there is something curious:
kafka1_1 | [2019-02-28 19:34:07,944] INFO [TransactionCoordinator id=0] Initialized transactionalId transactional-id with producerId 382001 and producer epoch 876 on partition __transaction_state-31 (kafka.coordinator.transaction.TransactionCoordinator)
kafka1_1 | [2019-02-28 19:34:07,936] INFO [TransactionCoordinator id=0] Initialized transactionalId transactional-id with producerId 382001 and producer epoch 877 on partition __transaction_state-31 (kafka.coordinator.transaction.TransactionCoordinator)
Note how epoch 877 is initialized at 2019-02-28 19:34:07,936 and 876 at 2019-02-28 19:34:07,944. This means that at least according to the logs, epoch 876 was initialized 8ms after 877. If we then look at the difference in txStartTimestamp
between the previous state and the one that is being transitioned to, the previous state is 8ms newer than the one that's being transitioned to.
Now, maybe this is a coincidence or I'm misinterpreting the logs, but this looks fishy to me.
Just a note that I learned about recently, which may explain why we're seeing rather many CONCURRENT_TRANSACTIONS
errors.
When we commit a transaction, the coordinator just writes a PrepareCommit
message to the log, and then later it asynchronously writes the CompleteCommit
message. This means that if you initialize another transaction before everything has been written, you'll get a CONCURRENT_TRANSACTIONS
error.
I created #304, as it's related.
I've also been trying to reproduce this in different scenarios.
This certainly indicates that it is an issue in KafkaJS, but my suspicion is that it's timing-related, and running my Java consumer is A LOT slower than KafkaJS. So either I'm doing something weird, or there's something going on in the Java client that's waiting for something.
import org.apache.kafka.common.KafkaException;
import org.apache.kafka.clients.producer.*;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;
import java.util.Properties;
import java.util.concurrent.Future;
public class KafkaTest {
@Test
@DisplayName("Idempotent producer 0.11.0.2")
public void producer() throws Exception {
long noOfProducers = 5000;
long noOfMessages = 128;
String topic = "test-topic";
Properties producerConfig = new Properties();
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.3.220.56:9092");
producerConfig.put(ProducerConfig.CLIENT_ID_CONFIG, "transactional-producer");
producerConfig.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
producerConfig.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "test-transactional-id");
producerConfig.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
producerConfig.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
for (int producers = 0; producers < noOfProducers; producers++) {
Producer<String, String> producer = new KafkaProducer<>(producerConfig);
producer.initTransactions(); // initiate transactions
try {
producer.beginTransaction(); // begin transactions
for (int i = 0; i < noOfMessages; i++) {
Future<RecordMetadata> result = producer
.send(new ProducerRecord<String, String>(topic, "key-" + i, "message-" + i));
}
producer.commitTransaction();
} catch (KafkaException e) {
// For all other exceptions, just abort the transaction and try again.
producer.abortTransaction();
}
producer.close();
}
}
}
@plameniv we ran several tests using your reproducer, and the results are inconclusive. When running against a local Kafka (non-dockerized), it works all the time. We can only reproduce the problem when running on a dockerized Kafka, and like @Nevon said it is related to a timing issue. We can't reproduce on the Java client because it can't go through the same flow fast enough. I've added a sleep
call to your reproducer (right after disconnect
), and managed to get it to work all the time. Kafka is removing this check, and it will make commit
a blocking call, which should fix this for good.
Do you have a use case where you have to trash the producer all the time? When using transactions you should keep the same producer, so I'm assuming you are just testing the edge cases. I don't think this is a blocker for the 1.5.0
release, but I'll wait for your answer.
Thanks again for the great issues.
agreed, this is an edge case and not a blocker for the 1.5.0 release
I will release 1.5.0
then, @sklose @plameniv thanks for shaping up this release!
@plameniv we ran several tests using your reproducer, and the results are inconclusive. When running against a local Kafka (non-dockerized), it works all the time. We can only reproduce the problem when running on a dockerized Kafka, and like @Nevon said it is related to a timing issue. We can't reproduce on the Java client because it can't go through the same flow fast enough. I've added a
sleep
call to your reproducer (right afterdisconnect
), and managed to get it to work all the time. Kafka is removing this check, and it will makecommit
a blocking call, which should fix this for good.Do you have a use case where you have to trash the producer all the time? When using transactions you should keep the same producer, so I'm assuming you are just testing the edge cases. I don't think this is a blocker for the
1.5.0
release, but I'll wait for your answer.Thanks again for the great issues.
Hi @tulios , Do you know in which version of Kafka they have removed this check. I am facing this while trying to use transactional producer. Just wanted to know upgrading the version would solve this.
We ran into an
java.lang.IllegalStateException
while testing the EoS implementation and we are not sure what is causing it.The scenario is as follows: in a loop, we create a KafkaJs client and a producer and we write a number of messages in a single transaction, then discard the client.
What we observe is that after a variable number of iterations the following exception happens:
(further detail below).
The Kafka cluster remains up, however, a subsequent run of the reproducer results in
KafkaJSNumberOfRetriesExceeded
after a number of retries on theCONCURRENT_TRANSACTIONS
error. This is regardless of if we use the same topic.Reproducer:
Kafka Log:
KafkaJs Log:
KafkaJs Log after second run of reproducer: