ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

executeJs can "hang" due to a crashed chrome renderer #255

Open andynuss opened 3 months ago

andynuss commented 3 months ago

I have written my own waitForFrameLoad (also for the main page) which works in a unique way, which I came up with because some test pages that used iframes in a variety of ways other than thru urls where failing in the normal frame.waitForLoad().

This is a function I am using to inject into the dom with executeJs, on the assumption that if it works, then the frame is nearly ready and so is the main page.

function rawDomCapture(): RawDom {
  const result: RawDom = {};
  try {
    const win: any = window;
    const doc = win.document;
    const html: string = doc.documentElement?.outerHTML;
    if (typeof html === 'string') {
      let location: string = win.location.href;
      if (location == null) location = '';
      result.dom = { html, location };
    }
  } catch (e) {
    result.error = '' + e.stack;
  }
  return result;
}

Then, in a loop I inject this javascript snippet with the executeJs plugin:


Here's the bug: under load, and when using several concurrent Hero scrapers on an ec2 instance with plenty of memory and cpu, it seems that my very aggressive use of executeJs plugin somehow triggers the ulixee chrome renderer to crash/disconnect, more so than normal. Another aspect of this is that I am going after ALL iframes up to a depth of 2 (relative to the main frame), and that when this behavior manifests, it appears that there are sometimes 50-100 frameIds that I have been aggressively doing this executeJs trick with. Anyways, I wrote some nodejs code to use the shell command ps -p {pid_list} to somewhat infer that indeed, the chrome "renderer" process was crashing.

But the reason that I am reporting this as a bug that might be fixable is that frame.executeJs() does not offer a timeout argument, and clearly my injection does very little, though perhaps sooner than Hero is used to (!), so there should no reason for this executeJs to "hang".

Again, the reason that it is hanging is that either the chrome "renderer" has already crashed, or the act of doing this so many times for the same Hero instance causes the crash.

My solution was to use a "race" promise to throw an error right away, which I treat as follows, rather than restarting my server, I hope that if I close Hero, your code will detect that the chrome has crashed/disconnected, and things will recover. I immediately retry the scrape and do not go after iframes so aggressively.

So these might be the possible bugs:

  1. tell me that devtools just can't handle the crazy thing I am doing, and is much more likely to crash within or due to executeJs
  2. maybe Hero is better equipped to detect the agent crash than I am

NOTE: before I ramped up the load and discovered this problem, I did see this exception being thrown at exactly 60 secs, but rarely: a ulixxe TimeoutError whose message is 'DevtoolsApiMessage did not respond after 60 seconds.'

Maybe that's what I am seeing as a hang, but 60 seconds seems to be quite a long time to wait for this information.

By the way, at one point, I was hoping that for a given Hero instance with many many iframes, I would be allowed to use my executeJs trick in parallel/independently.


I guess that there may be nothing at all to do in your codebase, but I wanted to report this anyways, and learn more about what is going on under the covers: for example, is there a way for me to reliably get the "pid" for the ulixxee chrome renderer? Also, what is this chrome renderer? Is there a 1:1 relationship between it and a Hero instance? If this chrome renderer is NOT chrome, but is another ulixxee process, then I would think that it should NEVER crash/disconnect, especially in a way such as this: that I can easily force to happen within about 15 scrapes, especially if I am going after media sites that are ad-heavy (and therefore with lots of those unnecessary iframes).

At a minimum, it would be nice if Hero would provide a way for me to get the "pid" of this crashable renderer (assuming I am reasoning correctly about this issue).

(BTW: I have ensured that the latest chrome itself is being installed on ec2-instances. Is this being used at all? Because I'm not seeing it listed as a direct or indirect child process of my server which embeds the hero code. Does the ulixxee chrome renderer app replace chrome in some way?)

The reason that I am seeing this as a Hero bug for now, is that the rate at which this happens is at about the threshold where I cannot go to a larger ec2 instance size for fear that the retry rate (due to the sporadic crashed chrome renderer) would mean that I was wasting the extra processing power.

blakebyrnes commented 3 months ago

Hi Andy, I think there's a single chrome renderer per incognito window that gets created. Hero by default will share 10 hero instances across a single chrome instance, but that's configurable with the maxHeroes... variables. You could experiment to see if those help at all. I don't know how to get those PIDs.

I agree we should add a timeout publicly to executeJs. Should be easy enough to fork and create your own version that does so. ExecuteJs is in many ways meant to just be a template to use as your starting point.

I suspect you're crashing because outerHTML at a top level requires chrome to re-render the entire page, and to do so, it has to lock the event loop and redraw. It's called a reflow. Fwiw, you're measuring something that is already done in hero. Hero is already recording all the individual dom changes and tracking them in the DomStateChanges table so that it can rebuild your page. You might find that creating a fork of Hero where you make those statistics available to your front-end more efficiently solves your problem. Would be open to a PR that does that too.

Hope that helps

andynuss commented 3 months ago

When you speak of maxHeroes, I need to tell you that I am doing this:

export async function getCreateConnection(): Promise<ConnectionToHeroCore> {
  if (connection !== undefined) return connection;
  const bridge = new TransportBridge();
  const maxConcurrency = maxAgents;
  const connectionToCore = new ConnectionToHeroCore(bridge.transportToCore, { maxConcurrency } );
  const heroCore = new HeroCore();
  heroCore.addConnection(bridge.transportToClient);
  connection = connectionToCore;
  core = heroCore;
  return connectionToCore;
}

... which is an exported function of my own that I use to create (first time) a ConnectionToHeroCore and cache it in a variable, and use it each time I want to create a Hero instance. So my maxAgents is configured to 12 on this machine, and I strive to keep 10 concurrent scrapes happening. Each Hero is what I call an "agent" (in the old terminology), and it does appear that there is one chrome "renderer" PID per Hero created this way.

I think you call this: 'running on one server (full stack deployment)".

  1. So, Maybe I shouldn't be using a separate Hero instance for each concurrent scrape session?
  2. Would you like me to give you the full path of the 10+ chrome "renderers"? Are they your code or Chromes? I ask because their path is in ulixee's node_modules.
  3. It also appears that these "renderers" are supposed to be terminated when a Hero instance is closed, but sometimes they stay around "forever". More research to do on this.
  4. as far as the "hang", it seems to be related not to a hang inside the renderer during executeJs, but that the "renderer" for this scrape has disappeared, I assume crashed, and so the question becomes, whose code is it, and what was the trigger?

That's why I was wondering: do I even need to have native chrome installed on my ec2 instance?

This is what I am doing when I create the AMI:

cat <<EOF | sudo tee /etc/yum.repos.d/google-chrome.repo > /dev/null
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/\$basearch
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub
EOF

sudo yum install -y google-chrome-stable xorg-x11-server-Xvfb

Maybe this "real" chrome is not needed/used?

blakebyrnes commented 3 months ago

This is the feature I mean: https://ulixee.org/docs/hero/overview/configuration#core-start

The chrome you're installing won't be used by Hero, so kind of pointless unless you're pointing at it with the env vars, in which case, it will occasionally not correctly mask the browser variables

andynuss commented 3 months ago

When I looked at the link you gave me, it allows options to the HeroCore constructor for both maxConcurrentClientCount and maxConcurrentClientsPerBrowser.

Meanwhile, you can see that what I tried to do was modeled after the full stack deployment: https://ulixee.org/docs/hero/advanced-concepts/deployment

and there I set neither of those options but set maxConncurrency for new ConnectionToHeroCore().

The strange thing is that peeking under the covers, whenever I created a new Hero(), I always got a new chrome renderer.

Not sure if that is more or less advisable than sharing a single chrome across multiple concurrent urls, that is, from emulator standpoint, but from crash standpoint, given that my ec2 instance has 16 GB heap, it made sense not to let heap issues and crash-inducing issues infect other the other scraping sessions.


Speaking of the crashpad-handler, I verified that it is NOT hooked up on my mac, but IS hooked up on the ec2 instance (linux), so I decided to grab the crashpad-handler-pid from the launch command of the first chrome renderer I see, and kill it immediately.

Obviously, it would be nicer if you could make sure that the crashpad is not enabled, but regardless, this is not related to the crashed chrome renderers that I am seeing.


On the crashed chromes, here are some observations but first let me give you some Wait functions that I tried:

function LogIfNotTimeout(frameNode: FrameNode, err: any): void {
  if (err instanceof TimeoutError) return;
  const { id: frameId, isMain } = frameNode;
  const whichFrame = isMain ? 'main frame' : `frameId ${frameId}`;
  console.log(`unexpected error in waitForLoad for ${whichFrame}`, err);
}

async function WaitForDomLoad(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<boolean> {
  try {
    await frameNode.frame.waitForLoad('DomContentLoaded', { timeoutMs });
    return true;
  } catch (e) {
    LogIfNotTimeout(frameNode, e);
    return false;
  }
}

async function WaitForJavascriptReady(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<boolean> {
  // FIXME: this only works for url-based iframes.  Need to file an
  // issue for Blake for srcdoc, javascript protocol, etc
  try {
    await frameNode.frame.waitForLoad('JavascriptReady', { timeoutMs });
    return true;
  } catch (e) {
    LogIfNotTimeout(frameNode, e);
    return false;
  }
}

async function WaitForDomContentLoaded(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<void> {
  const { frame } = frameNode;
  let timeout: NodeJS.Timeout | undefined;
  const cancelPromise: Promise<void> = new Promise((resolve) => {
    timeout = setTimeout(() => {
      timeout = undefined;
      resolve();
    }, timeoutMs);
  });
  await Promise.race([cancelPromise, frame.isDomContentLoaded]);
  if (timeout) clearTimeout(timeout);
}

async function WaitForPause(
  frameNode: FrameNode,
  initialPauseMillis: number,
): Promise<void> {
  const { discoveredAt } = frameNode;
  const elapsed = Date.now() - discoveredAt;
  const sleepMs = initialPauseMillis - elapsed;
  if (sleepMs > 0) await sleep(sleepMs);
}

export async function waitForFrame(
  varAgent: VarAgent,
  scraperArg: ScraperArg,
  scraperState: ScraperTrace,
  frameNode: FrameNode,
  waitLoadMillis?: number,
): Promise<boolean> {
  const { isMain } = frameNode;
  if (isMain) {
    // the main frame we wait for DomContentLoaded for a bit
    await WaitForDomLoad(frameNode, 15000);
  } else {
    await WaitForPause(frameNode, 7500);
  }
  if (waitLoadMillis == null) waitLoadMillis = isMain ? 60000 : 30000;
  // NOTE: WaitDomStable() is where I immediately start calling executeJs().
  const isStable = await WaitDomStable(varAgent, scraperArg, scraperState, frameNode, waitLoadMillis);
  return isStable;
}
  1. if my frameNode.isMain is true (derived from your FrameEnvironment object), I never get a crash so long as I do call WaitForDomLoad() before I call WaitDomStable.

  2. but for frames (not main), the analog, which I believe is WaitForJavascriptReady() does significantly worse than WaitForPause(), and even worse than not doing any "pre-waiting" before calling the risky WaitDomStable().

  3. this seems to imply that WaitDomStable() (which I described before as immediately calling executeJs() to grab the outerHtml and reach "stability") when called "too soon", actually triggers the crash.

  4. but this seems to relate to why I abandoned use of WaitForJavascriptReady(), because with unit testing of all sorts of iframes that were not url-based, I found that WaitForJavascriptReady() ALWAYs crashed chrome when the page's iframes where non-url based. (see below)

  5. likewise, for my non-url based iframe test-suite (on my localhost test server), WaitDomStable() also always crashed the chrome renderer (executeJs() called "too soon"). This is why I settled on WaitForPause() as being the best I could do.

  6. here's an url that ALWAYS fails for me in WaitDomStable() for one of the iframes (seems to be the same one) even after WaitForPause(): https://www.haproxy.com/blog/haproxy-is-not-affected-by-the-http-2-rapid-reset-attack-cve-2023-44487

With real pages, especially media sites, like bbc.com and theguardian.com, my guess is that they are injecting both ad-frames and "noise" frames that typically do not use direct urls, because this technique prevents adblock from doing anything about it. So these are the types of iframes I am speaking of that I am sure based on my localhost testing can trigger the crash (either in executeJs() or frame.waitForLoad('JavascriptReady', { timeoutMs }):

My overall error rate is about 1 per 7, meaning that with the waitForFrame() function that you see above, and given that I usually try scrape child frame doms (except for obvious domains like github.com), about 1 out of 7 pages causes the crash inside waitForFrame(). Many of these will work if attempted a 2nd or 3rd time, even still going after the iframes. But since speed is important, I chose to not get any iframe doms via executeJs() on the 2nd attempt.

andynuss commented 3 months ago

Hi Blake,

I updated this issue with some more discoveries about the trigger for the hang due to a chrome crash caused by executeJs() being called (too soon) for an iframe, in the hopes that it is a bug that you can actually fix or workaround.

Let me know if you would like my localhost examples of non-url based child frames.

Andy


From: Blake Byrnes @.> Sent: Monday, March 25, 2024 8:29 AM To: ulixee/hero @.> Cc: andy stagirite.com @.>; Author @.> Subject: Re: [ulixee/hero] executeJs can "hang" due to a crashed chrome renderer (Issue #255)

This is the feature I mean: https://ulixee.org/docs/hero/overview/configuration#core-start

The chrome you're installing won't be used by Hero, so kind of pointless unless you're pointing at it with the env vars, in which case, it will occasionally not correctly mask the browser variables

— Reply to this email directly, view it on GitHubhttps://github.com/ulixee/hero/issues/255#issuecomment-2018281380, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAROFVS7SBJXCLXRNOYJOSDY2A7HDAVCNFSM6AAAAABFED46CGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGI4DCMZYGA. You are receiving this because you authored the thread.Message ID: @.***>

andynuss commented 3 months ago

1.html.txt about.html.txt emptysrc.html.txt emptysrc2.html.txt javascript.html.txt nosrc.html.txt srcdoc.html.txt

Just put these files into some folder of a localhost web server, and strip the .txt extension, and then try to scrape the outerHtml of all available frames via executeJs() (for all but the first).