ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Error sending messages to topic with Kafka #32

Closed ivandonofrio closed 5 years ago

ivandonofrio commented 5 years ago

Kafka returns this error while sending messages to topic to run a crawl test as described in the documentation:

cat testdata/seed.json | $KAFKA/kafka-console-producer.sh --broker-list localhost:9092 --topic uris.tocrawl.fc

[2019-03-21 10:08:26,427] ERROR Error when sending message to topic uris.tocrawl.fc with key: null, value: 361 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for uris.tocrawl.fc-10: 1542 ms has passed since batch creation plus linger time

It would be possible to have clarifications about how to set up a new crawl and from which docker container or environment launch these commands?

Thanks for your explanations.

anjackson commented 5 years ago

Not sure but I think this depends on your Kafka setup. How are you running Kafka?

EDIT e.g. having just run

docker-compose up

in one terminal, I did this in another (using my installed version of Kafka):

cat testdata/seed.json | /usr/local/bin/kafka-console-producer --broker-list localhost:9092 --topic uris.tocrawl.fc

and it worked.

ivandonofrio commented 5 years ago

My version of Kafka is 1.1.0 and I installed it following this guide: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04

Despite of this I needed to run these commands to make the producer work once composed docker:

docker cp /home/<user>/ukwa-heritrix/testdata/seed.json ukwa-heritrix_kafka_1:/tmp/ docker exec -it ukwa-heritrix_kafka_1 bash cat /tmp/seed.json | kafka-console-producer.sh --broker-list 172.17.0.1:9092 --topic uris.tocrawl.fc

What Kafka version are you using? How do you installed it?

anjackson commented 5 years ago

I'm using Kafka inside Docker:

https://github.com/ukwa/ukwa-heritrix/blob/491991a1079aba4cc0f24bb03a9cad74ce94d041/docker-compose.yml#L94-L114

This works because all the services are on the same internal Docker network. As you are running your own Kafka on the host, you'll need to the KAFKA_BOOTSTRAP_SERVERS environment variable for Heritrix so it points to your Kafka.

https://github.com/ukwa/ukwa-heritrix/blob/491991a1079aba4cc0f24bb03a9cad74ce94d041/docker-compose.yml#L23

anjackson commented 5 years ago

In case it helps, I've found I have to use this form to connect to Kafka successfully from separate containers when running on the same host:

docker run --network="host" ukwa/ukwa-manage submit -k 192.168.X.X:9094 -L now fc.tocrawl.bypm http://acid.matkelly.com/

HTH. I'll close this but feel free to re-open if necessary.