istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

`Demo.incoming` topic keeps piling up #221

Closed YanzhongSu closed 5 years ago

YanzhongSu commented 5 years ago

The original question was posted in Stackoverflow demo.incoming topic in Kafka keeps piling up.

I am using scrapy cluster. About 70 requests per second are submitted to Kafka via Scrapy Cluster REST api (Producer). The spiders can finish the crawl pretty fast because the queue in redis remains at a very low number, less than 10 most of the time. But the number of messages in demo.incoming keep piling up every second. This is the command I used to check the number of messages in demo.incoming topic in Kafka

kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list localhost:9092,kafka-statefulset-2:9092,kafka-statefulset-1:9092 \
--topic demo.incoming \
--time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'

I thought it was because Kafka-monitor(Consumer) cannot pick up the message from Kafka and push to redis fast enough that causes demo.incoming topic to pile up. Despite of scaling up Kafka-monitor to 30 replicas, the topics still keeps piling up.

In theory, the number of messages in Kafka should remain at a very low number. Because the Consumer, Kafka-monitor in this case, should consume the message as soon as it arrives considering it has more than 30 replicas.

madisonb commented 5 years ago

It seems to me like you don't have enough partitions in your topic to keep up with your inbound messages. What you are saying is that you scale the kafka monitor up to 30, yet are still falling behind. To me, this indicates poor kafka topic performance, and one of the remedies is to increase the partition count on your topic so something like 5-10.

In kafka, only 1 consumer can consume from a partition at any given time. Lets assume by default you have 1 partition in your topic, so at max you can have 1 consumer at any given time reading from it, no matter how much you scale your consumers. Setting the partition count to say, 8-10 would allow you 8-10 simultaneous consumers each reading from their 1 partition.

To me it seems like you have your cluster tuned very well to keep your redis count low, (great!) but you need to increase your partition count on your incoming topic so you can utilize your replica consumers.

1 partition = 1 simultaneous consumer 10 partitions = 10 simultaneous consumers

etc. You want to tune the kafka-monitor count to be roughly equal to the number of partitions on your topic.

Check out the first couple of paragraphs and pictures at this link for more info

YanzhongSu commented 5 years ago

Thank you so much @madisonb You answer solved my question. You saved my day. I re-partitioned the topic to 8 partitions. It worked as expected right now.

One thing confuses me is that the number of messages still keeps increasing. I was told on Stackoverflow that the command I used above to check the number of messages in Kafka is not correct and was suggested to use kafka-consumer-groups tool instead. I have yet to found the correct command.

madisonb commented 5 years ago

The main thing you should care about is consumer lag, not consumer offset numbers. The numeric offset will continue to climb, since each message will get its own numeric offset number. What you want instead is lag, which is how far behind the consumer group is from real time. Ideally this stays at 0 or perhaps in production <1000, but if it continues to grow you need to continue to scale to meet your throughput demands.

I personally use a tool like Klag to view kafka lag for the consumers in my cluster.

# klag_conf.json
# key is group, value is topic name
{
  "sc-kafka-monitor":["demo.incoming"]
}
$ klag --groups-file klag_conf.json
Group     Topic                                                        Remaining
================================================================================
sc-kafka-monitor                                                       [STABLE]
          demo.incoming                                                     164
YanzhongSu commented 5 years ago

As you can see from the screenshot below, in my case, the group is called demo-group, under it, there is no topic for some reason.

image

madisonb commented 5 years ago

I would double check you have a list of topic names as your value for your key, and that everything is spelled correctly. I don't typically run the latest version of kafka, so perhaps that library is out dated.

Regardless, you can try this command to get the current lag of your consumer group.

$ docker-compose exec kafka /opt/kafka/bin/kafka-consumer-groups.sh --describe --group demo-group --bootstrap-server kafka:9092

# when I run it on my local docker-compose cluster, I get something like the following
TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                             HOST            CLIENT-ID
demo.incoming   0          0               0               0               kafka-python-1.3.3-ab7452ff-107e-46ff-be4d-850ac318c342 /172.23.0.5     kafka-python-1.3.3
YanzhongSu commented 5 years ago

@madisonb Cheers, this worked.