internetarchive / umbra

A queue-controlled browser automation tool for improving web crawl quality
Apache License 2.0
60 stars 25 forks source link

Include URL in BrowserThread name, so it will be part of every log li… #72

Open blekinge opened 5 years ago

blekinge commented 5 years ago

Include the URL in the browserthread name, so it will be logged explicitly. As it seems that each browserthread is created to handle exactly one URL, having both the URL and the chrome port name as part of the thread name makes sense. And for debugging, it is a LOT easier to read the logs

nlevitt commented 5 years ago

Url can be very very long. It's too much I think. It's pretty easy to track a url in the logs by grepping for the browser port as you point out. The url is logged on a line like this:

2019-01-29 22:34:50,264 21163 INFO BrowsingThread:35490 umbra.controller.AmqpBrowserController.browse_page_sync(controller.py:284) browser=... client_id=... url=... behavior_parameters=...

Then all the subsequent lines with :35490 are about that url.