Open vladholubiev opened 7 years ago
Hi @vladgolubev. Hm.. I suspect you've run into the 128 KB AWS IoT message broker limit. Not sure about the best solution, but we'll need to figure something out as I can imagine 128 KB won't be enough in many situations..
Is it possible to gzip content?
@adieuadieu maybe similar solution as for pdfs/screenshots?
Implement .html()
method(https://github.com/graphcool/chromeless/issues/74) which will upload ${cuid()}.html
file to S3 bucket?
I'm thinking something along the lines of breaking up the payload into multiple messages-chunks that get passed around by the MQTT broker—perhaps gzipping them onto of that. We would like to support Azure and GCP in the future, too, so also need to take their equivalent messaging products and their limits into consideration.
@vladgolubev we don't have to worry about the response payload limit (or any APIG limits) since we never respond with anything Chrome-related from the Lambda function's callback()
. Currently, everything is communicated between Chromeless and the Proxy (running on Lambda) over MQTT (AWS IoT).
@adieuadieu Can 6MB response payload limit for Lambda or 10MB for API Gateway will be an issue later even after splitting? Or chromeless doesn't interact w/ Lambda directly?
Thanks, now I got it!
Wanted to leave here as a reference how AWS encapsulated a solution for a similar problem - https://aws.amazon.com/about-aws/whats-new/2015/10/now-send-payloads-up-to-2gb-with-amazon-sqs/
But now I see splitting messages is a more generic solution.
Because it may work for html now, but then the same problem will pop up when someone wants to return a large array of URLs or whatever from .evaluate()
@vladgolubev Hi, I am having issues using .html() with size limits as mentioned above.
You mentioned that .html
saves to S3 was implemented (${cuid()}.html
), however I'm not seeing them in the S3 bucket, do see the .png though.
@labithiotis sorry if it was misleading. I only suggested that solution. This size issue is still being resolved by @adieuadieu
@vladgolubev Great to know, but is there anything I could do now to resolve this? Either increase limits or save html?
I think saving the html file is the best solution for the time being. @adieuadieu and @schickling what do you think? .html
can return a large payload depending on the page
Another option would be to implement message chunking for the websocket connection.
Alternatively, we should make it easier to work with S3 while at the same time decoupling it from APIs like .screenshot
etc. WDYT?
I think there's a longer-term task to make chunking happen.. but seems like it is still a ways off. I can also see the case where folks want to persist more than just html
to disk (IE: dumps of local-store or other serializable values) in S3.
Maybe the solution is in doing both to a degree:
saveScreenshot
and saveHtml
)I adjusted the code to filter through/search over the page dom in evaluate and avoid passing back huge payloads.
@joelgriffith @labithiotis I have added a htmlUrl()
endpoint on this fork: https://github.com/YazzyYaz/chromeless and it works locally on my computer, returning back a file on my desktop with the html. I'm trying however to test it on AWS Lambda, but my issue is that it doesn't recognize the endpoint after I deploy it. I even configured the package.json to point to the chromeless that is locally modified and it didn't help. Any ideas on what I'm doing wrong?
EDIT: I was doing something stupid, it works on AWS Lambda now :)
@adieuadieu @joelgriffith PR for this issue: https://github.com/graphcool/chromeless/pull/274
I'm using Chromeless to scrape html from websites, and discovered function never returns value if text is too big. I deployed my own serverless project provided in the repo. By trial and error I found it times out if returned html string is larger 131060 bytes.
Looking up this 'magical' number it seems to have some sense:
Is there any internal CDP limitation for 128KiB?