Open pawelmhm opened 8 years ago
For AsyncBrowserCommand __repr__
should be already truncated; what is in the log?
this can be reproduced easily by doing
import requests
import json
with open("test.lua") as f:
lua_source = f.read()
base_url = "http://localhost:8050/execute"
params = {
"lua_source": lua_source,
# full text of moby dick
"body_source": requests.get("https://www.gutenberg.org/files/2701/2701-h/2701-h.htm").text
}
s = requests.post(base_url, data=json.dumps(params), headers={"content-type": "application/json"})
s.raise_for_status()
-- test.lua
function main(splash)
splash:set_content(splash.args.body_source)
return splash:png()
end
full text of moby dick will end up in logs
Aha, so this is about logging of Splash arguments, not about logging function arguments. I agree that it makes sense to truncate them. New save_args/load_args made it a bit better (you can now avoid logging duplicate parameters), but if you're sending different values they won't help. Can't you use splash:http_get, by the way? It'd allow to save network traffic as you won't be sending body over network twice.
Maybe it is also good to be able to avoid logging some of the arguments (e.g. passwords), but that's another issue.
but if you're sending different values they won't help. Can't you use splash:http_get, by the way?
this is somewhat unfortunate and probably unusual case of spider that needs to parse rotating banners. Those banners change on every request, if you refresh you will see different banners. We want to get screenshot matching exactly what is seen by spider, so setting content seems like best (and only?) choice.
Can't you download a banner and take a screenshot in a single Splash script? Also, if body is a banner binary data (gif, png) then why is Splash needed?
Also, if body is a banner binary data (gif, png) then why is Splash needed?
good point, but banner is binary data (image) + link that need to be followed by spider (e.g. <a href="ff"><img src="ff"></a>
. So spider extracts links follows them and send request to splash to take screenshot of page when he found the links.
Can't you download a banner and take a screenshot in a single Splash script?
screenshot taking is part of larger crawl so we would have to reimplement or move some logic of making requests to splash, e.g. follow links from Splash Lua and not from spider. It could be doable but Lua script would be complicated.
The intended workflow is the following:
1) Scrapy spider sends a SplasRequest to download a page; 2) Lua script finds banners in a page (using some rules?); 3) Lua script figures out banner dimensions, takes their screenshots; 4) Lua script returns HTML contents of a page and screenshots of all banners to the Scrapy spider; 5) Scrapy spider can now store screenshots, extract links from HTML and send new Splash requests.
Currenttly (2) can be easier to implement in Scrapy than in Splash, but is there anything else which makes this hard?
well yeah 2) is main blocker, defining those rules in Lua is certainly possible (using JS xpath or css functions) but it's much easier in Scrapy. Also 1) would require us to send all requests to splash, and banners are not on all pages. Rendering with splash is expensive, even if you make splash:http_get you still need to send request to splash and get response instead of doing GET directly from spider to website.
Plus it's often easier to debug scrapy spider rather than Splash Lua script.
I just downloaded logs for one splash application and it is really huge file. This is because we do splash:set_content(response.body) in one spider and this means that whole huge response body is dumped into logs.
Maybe we should truncate long arguments, let's say only show 1k characters per parameter logged in json.