schickling / chromeless

🖥 Chrome automation made simple. Runs locally or headless on AWS Lambda.
https://chromeless.netlify.com
MIT License
13.25k stars 574 forks source link

128 KB AWS IoT message broker limit with .evaluate() result #114

Open vladholubiev opened 7 years ago

vladholubiev commented 7 years ago

I'm using Chromeless to scrape html from websites, and discovered function never returns value if text is too big. I deployed my own serverless project provided in the repo. By trial and error I found it times out if returned html string is larger 131060 bytes.

const chromeless = new Chromeless({ remote: true })

const text = await chromeless
  .goto('https://www.graph.cool')
  .evaluate(() => 'a'.repeat(131061)) // times out, but 131060 works

console.log(text)

await chromeless.end()

Looking up this 'magical' number it seems to have some sense:

image

Is there any internal CDP limitation for 128KiB?

adieuadieu commented 7 years ago

Hi @vladgolubev. Hm.. I suspect you've run into the 128 KB AWS IoT message broker limit. Not sure about the best solution, but we'll need to figure something out as I can imagine 128 KB won't be enough in many situations..

joelgriffith commented 7 years ago

Is it possible to gzip content?

vladholubiev commented 7 years ago

@adieuadieu maybe similar solution as for pdfs/screenshots?

Implement .html() method(https://github.com/graphcool/chromeless/issues/74) which will upload ${cuid()}.html file to S3 bucket?

adieuadieu commented 7 years ago

I'm thinking something along the lines of breaking up the payload into multiple messages-chunks that get passed around by the MQTT broker—perhaps gzipping them onto of that. We would like to support Azure and GCP in the future, too, so also need to take their equivalent messaging products and their limits into consideration.

adieuadieu commented 7 years ago

@vladgolubev we don't have to worry about the response payload limit (or any APIG limits) since we never respond with anything Chrome-related from the Lambda function's callback(). Currently, everything is communicated between Chromeless and the Proxy (running on Lambda) over MQTT (AWS IoT).

vladholubiev commented 7 years ago

@adieuadieu Can 6MB response payload limit for Lambda or 10MB for API Gateway will be an issue later even after splitting? Or chromeless doesn't interact w/ Lambda directly?

vladholubiev commented 7 years ago

Thanks, now I got it!

Wanted to leave here as a reference how AWS encapsulated a solution for a similar problem - https://aws.amazon.com/about-aws/whats-new/2015/10/now-send-payloads-up-to-2gb-with-amazon-sqs/

But now I see splitting messages is a more generic solution.

Because it may work for html now, but then the same problem will pop up when someone wants to return a large array of URLs or whatever from .evaluate()

labithiotis commented 7 years ago

@vladgolubev Hi, I am having issues using .html() with size limits as mentioned above. You mentioned that .html saves to S3 was implemented (${cuid()}.html), however I'm not seeing them in the S3 bucket, do see the .png though.

vladholubiev commented 7 years ago

@labithiotis sorry if it was misleading. I only suggested that solution. This size issue is still being resolved by @adieuadieu

labithiotis commented 7 years ago

@vladgolubev Great to know, but is there anything I could do now to resolve this? Either increase limits or save html?

joelgriffith commented 7 years ago

I think saving the html file is the best solution for the time being. @adieuadieu and @schickling what do you think? .html can return a large payload depending on the page

schickling commented 7 years ago

Another option would be to implement message chunking for the websocket connection.

Alternatively, we should make it easier to work with S3 while at the same time decoupling it from APIs like .screenshot etc. WDYT?

joelgriffith commented 7 years ago

I think there's a longer-term task to make chunking happen.. but seems like it is still a ways off. I can also see the case where folks want to persist more than just html to disk (IE: dumps of local-store or other serializable values) in S3.

Maybe the solution is in doing both to a degree:

labithiotis commented 7 years ago

I adjusted the code to filter through/search over the page dom in evaluate and avoid passing back huge payloads.

YazzyYaz commented 7 years ago

@joelgriffith @labithiotis I have added a htmlUrl() endpoint on this fork: https://github.com/YazzyYaz/chromeless and it works locally on my computer, returning back a file on my desktop with the html. I'm trying however to test it on AWS Lambda, but my issue is that it doesn't recognize the endpoint after I deploy it. I even configured the package.json to point to the chromeless that is locally modified and it didn't help. Any ideas on what I'm doing wrong?

EDIT: I was doing something stupid, it works on AWS Lambda now :)

YazzyYaz commented 7 years ago

@adieuadieu @joelgriffith PR for this issue: https://github.com/graphcool/chromeless/pull/274