Closed 0x0ece closed 8 years ago
hmm it looks like those container exits are happening outside of the interface stack so we might have to detect that we're running in yarn to prevent this.
i also remember seeing a config option to tell spark to start trying ports from a different initial value - i don't know if this is still around and it's not the most robust but at least you'll be able to load data. thoughts?
Yes, this is kind of weird, it seems pretty hard to detect the failure at app level.
For my case specifically, I can either increase this maxRetries or look into this randomness (although every default value related to port allocation seems to be random, so I guess this might be a specific issue of Kafka integration... have to look into that).
What I'm kind of scared is how many similar cases Spark/Yarn are hiding. A nice to have could be a sort of "transaction" where either all containers start, or all are killed, but I have no idea if this is supported :/
for what its worth this doesn't seem like an issue with kafka - i think yarn containers consume some ports and so starting new executors in an existing yarn cluster decreases the number of available ports for spark and its various web servers to bind to. maybe there is something that can be done in yarn to get around this?
I'll try to look into more details.
I was saying Kafka in the sense that maybe the way containers are requested for the DStream computation forces this incremental try... but I have no idea, maybe it's a wider issue and I just noticed it in this case.
if what i said is correct, you should be able to repro this issue by creating a vanilla rdd with a large number of partitions (how many partitions are in your kafka topic?) and when it gets processed the same number of executors should be created.
Closing for now, this is also caused by dynamicAllocation, which I have finally turned off :)
Hi, I think I've found an issue reading from Kafka when there are too many partitions, specifically when the number of partitions is greater than spark.port.maxRetries (default 16).
Note that I'm using a CDH Spark cluster in yarn-client mode, so I'm not sure this also repros with MemSQL Spark distro.
In summary, when a container starts it tries to bind to a port starting from an initial port, and fails after spark.port.maxRetries. If the Kafka topic has n partitions, then at least n containers are created, and if n > spark.port.maxRetries the available ports are saturated and eventually containers will fail to start.
I think in these cases it would be better to shut down the pipeline (like a fatal error), such that other pipelines can run without issues.
What happens now is that the exceptions are thrown but "ignored" and data is loaded normally. However I'm not sure if all data is loaded in all cases, as creating a test is kind of laborious... but I'd expect that if you have data in all partitions, some would fail.
Interface logs:
Container logs from: http://...:8042/node/containerlogs/container_1461362160503_0004_01_000010/stderr/stderr/?start=0