splash should truncate long strings when they are printed in logs

scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API

BSD 3-Clause "New" or "Revised" License

4.09k stars 513 forks source link

splash should truncate long strings when they are printed in logs #446

Open pawelmhm opened 8 years ago

pawelmhm commented 8 years ago

I just downloaded logs for one splash application and it is really huge file. This is because we do splash:set_content(response.body) in one spider and this means that whole huge response body is dumped into logs.

Maybe we should truncate long arguments, let's say only show 1k characters per parameter logged in json.

kmike commented 8 years ago

For AsyncBrowserCommand __repr__ should be already truncated; what is in the log?

pawelmhm commented 8 years ago

this can be reproduced easily by doing

import requests
import json

with open("test.lua") as f:
    lua_source = f.read()

base_url = "http://localhost:8050/execute"
params = {
    "lua_source": lua_source,
    # full text of moby dick
    "body_source": requests.get("https://www.gutenberg.org/files/2701/2701-h/2701-h.htm").text
}
s = requests.post(base_url, data=json.dumps(params), headers={"content-type": "application/json"})
s.raise_for_status()

-- test.lua
function main(splash)
    splash:set_content(splash.args.body_source)
    return splash:png()
end

full text of moby dick will end up in logs

kmike commented 8 years ago

Aha, so this is about logging of Splash arguments, not about logging function arguments. I agree that it makes sense to truncate them. New save_args/load_args made it a bit better (you can now avoid logging duplicate parameters), but if you're sending different values they won't help. Can't you use splash:http_get, by the way? It'd allow to save network traffic as you won't be sending body over network twice.

Maybe it is also good to be able to avoid logging some of the arguments (e.g. passwords), but that's another issue.

pawelmhm commented 8 years ago

but if you're sending different values they won't help. Can't you use splash:http_get, by the way?

this is somewhat unfortunate and probably unusual case of spider that needs to parse rotating banners. Those banners change on every request, if you refresh you will see different banners. We want to get screenshot matching exactly what is seen by spider, so setting content seems like best (and only?) choice.

kmike commented 8 years ago

Can't you download a banner and take a screenshot in a single Splash script? Also, if body is a banner binary data (gif, png) then why is Splash needed?

pawelmhm commented 8 years ago

Also, if body is a banner binary data (gif, png) then why is Splash needed?

good point, but banner is binary data (image) + link that need to be followed by spider (e.g. <a href="ff"><img src="ff"></a>. So spider extracts links follows them and send request to splash to take screenshot of page when he found the links.

Can't you download a banner and take a screenshot in a single Splash script?

screenshot taking is part of larger crawl so we would have to reimplement or move some logic of making requests to splash, e.g. follow links from Splash Lua and not from spider. It could be doable but Lua script would be complicated.

kmike commented 8 years ago

The intended workflow is the following:

1) Scrapy spider sends a SplasRequest to download a page; 2) Lua script finds banners in a page (using some rules?); 3) Lua script figures out banner dimensions, takes their screenshots; 4) Lua script returns HTML contents of a page and screenshots of all banners to the Scrapy spider; 5) Scrapy spider can now store screenshots, extract links from HTML and send new Splash requests.

Currenttly (2) can be easier to implement in Scrapy than in Splash, but is there anything else which makes this hard?

pawelmhm commented 8 years ago

well yeah 2) is main blocker, defining those rules in Lua is certainly possible (using JS xpath or css functions) but it's much easier in Scrapy. Also 1) would require us to send all requests to splash, and banners are not on all pages. Rendering with splash is expensive, even if you make splash:http_get you still need to send request to splash and get response instead of doing GET directly from spider to website.

Plus it's often easier to debug scrapy spider rather than Splash Lua script.