ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

hero.xpathSelectorAll does not work #257

Closed eNcacz closed 2 months ago

eNcacz commented 3 months ago

Here is the web page source code, which I use for demonstration of this bug:

<!DOCTYPE html>
<html>
  <head>
    <title>Testcase</title>
  </head>
  <body>
     <a href="#" id="noAction">Do nothing</a><br />
  </body>
</html>

This is the TypeScript application which demonstrates the bug.

import Hero from '@ulixee/hero-playground';

(async () => {
  console.log('Running Hero');
  const hero = new Hero({ showChrome: true});
  await hero.goto('http://localhost/ulixee/index.html');  // <---- REPLACE THIS URL ACCORDING TO YOUR ENVIRONMENT
  await hero.waitForPaintingStable()
  console.log('Page loaded');

  const elements = await hero.xpathSelectorAll('//a')
  console.log('Done')

  await hero.close();
})();

And this is the output of the application:

/usr/bin/node /home/vaclav/sandbox/ulixee/app.js
Running Hero
Connecting to Ulixee Cloud at localhost:1818
Page loaded

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^
Error [InjectedScriptError]: InvalidStateError: Failed to execute 'iterateNext' on 'XPathResult': The document has mutated since the result was returned.
    at JsPath.runJsPath (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/agent/main/lib/JsPath.ts:165:13)
    at async FrameEnvironment.execJsPath (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/node_modules/core/lib/FrameEnvironment.ts:246:12)
    at async CommandRecorder.runCommandFn (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/node_modules/core/lib/CommandRecorder.ts:90:16)
    at async CommandRunner.runFn (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/node_modules/core/lib/CommandRunner.ts:36:14)
    at async ConnectionToHeroClient.executeCommand (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/node_modules/core/connections/ConnectionToHeroClient.ts:258:12)
    at async ConnectionToHeroClient.handleRequest (/home/vaclav/cr/payment-gateway/lib/be/python/ulixee-hero-browser/node_modules/core/connections/ConnectionToHeroClient.ts:66:14)
------REMOTE CORE---------------------------------
  at Function.reviver (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/TypeSerializer.ts:249:26)
    at JSON.parse (<anonymous>)
    at Function.parse (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/TypeSerializer.ts:31:17)
    at WsTransportToCore.onMessage (/home/vaclav/sandbox/ulixee/node_modules/net/lib/WsTransportToCore.ts:105:36)
    at WebSocket.emit (node:events:517:28)
    at Receiver.receiverOnMessage (/home/vaclav/sandbox/ulixee/node_modules/ws/lib/websocket.js:1068:20)
    at Receiver.emit (node:events:517:28)
    at Receiver.dataMessage (/home/vaclav/sandbox/ulixee/node_modules/ws/lib/receiver.js:517:14)
    at /home/vaclav/sandbox/ulixee/node_modules/ws/lib/receiver.js:468:23
    at /home/vaclav/sandbox/ulixee/node_modules/ws/lib/permessage-deflate.js:308:9
------CONNECTION----------------------------------
  at new Resolvable (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/Resolvable.ts:19:18)
    at createPromise (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/utils.ts:140:10)
    at PendingMessages.create (/home/vaclav/sandbox/ulixee/node_modules/net/lib/PendingMessages.ts:47:44)
    at ConnectionToHeroCore.sendRequest (/home/vaclav/sandbox/ulixee/node_modules/net/lib/ConnectionToCore.ts:158:50)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CoreCommandQueue.sendRequest (/home/vaclav/sandbox/ulixee/node_modules/client/lib/CoreCommandQueue.ts:317:12)
    at async Object.cb (/home/vaclav/sandbox/ulixee/node_modules/client/lib/CoreCommandQueue.ts:231:16)
    at async Queue.next (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/Queue.ts:188:19)
------CORE COMMANDS-------------------------------
    at Queue.run (/home/vaclav/sandbox/ulixee/node_modules/commons/lib/Queue.ts:63:19)
    at CoreCommandQueue.run (/home/vaclav/sandbox/ulixee/node_modules/client/lib/CoreCommandQueue.ts:220:8)
    at CoreFrameEnvironment.execJsPath (/home/vaclav/sandbox/ulixee/node_modules/client/lib/CoreFrameEnvironment.ts:80:36)
    at execJsPath (/home/vaclav/sandbox/ulixee/node_modules/client/lib/SetupAwaitedHandler.ts:160:26)
    at Object.createNodePointer (/home/vaclav/sandbox/ulixee/node_modules/client/lib/SetupAwaitedHandler.ts:77:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async AwaitedHandler.createNodePointer (/home/vaclav/sandbox/ulixee/node_modules/files/2-finalized/awaited-dom/base/AwaitedHandler.ts:33:12)
    at async NodeFactory.createInstanceWithNodePointer (/home/vaclav/sandbox/ulixee/node_modules/files/2-finalized/awaited-dom/base/NodeFactory.ts:23:19)
    at async FrameEnvironment.xpathSelectorAll (/home/vaclav/sandbox/ulixee/node_modules/client/lib/FrameEnvironment.ts:239:20)

--------------------------------------------------
--------------------------------------------------
------6XPKSxbjXlMLmxjHXVHvz-----------------------
-------------------------------------------------- {
  pathState: { step: { '0': 'iterateNext' }, index: 1 }
}

Node.js v18.19.0

The result is still the same - it doesn't matter if the XPath selector selects zero, one or many elements.

I have tried to use hero.document.evaluate() instead, but it also crashed with very similar exception.

Expected behavior

The hero.xpathSelectorAll method will return a collection (possibly empty) with elements found.

blakebyrnes commented 3 months ago

Appears that if anything changes the document (I'm not sure what is triggering that here), it will break iterating xpath results (https://stackoverflow.com/a/27664220). Assuming you're an xpath user, do you usually use the snapshot approach?

eNcacz commented 3 months ago

I do not understand how the document is changed. I use only simple static document without any javascript inside. The document remains the same from the beginning to the end of the test.

I do not use snapshot approach. This is simplified code to demonstrate, that the XPath does not work even on simple static page. I my real code I need to work with dynamic web page and I use XPath selector in rare cases, when the page structure make impossible to use CSS selectors. I'm not sure if it is OK from performance point of view to create snapshot of the whole page before each element search ...

blakebyrnes commented 3 months ago

I'm guessing in your example, it's because you're using showChrome, which will add a mouse tracker by default. You can turn that off, but the underlying document changing problem will be present on a normal website, so it doesn't really matter the root cause here.

The link I sent has a different "result type" of the nodes, which is snapshot. It just means the nodes themselves are snapshotted, not the whole document. I'm wondering if we should change the underlying xpath code to use one of the snapshot options and iterators.

eNcacz commented 3 months ago

I tried to not use showChrome but it still crash in the same way.

Then I tried to implement it using snapshot result and it works:

import Hero, {XPathResult} from '@ulixee/hero-playground';

(async () => {
  console.log('Running Hero');
  const hero = new Hero();
  await hero.goto('http://localhost/ulixee/index.html');  // <---- REPLACE THIS URL ACCORDING TO YOUR ENVIRONMENT
  await hero.waitForPaintingStable()
  console.log('Page loaded');

  const document = await hero.document
  const xpResult = document.evaluate('//a', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)

  for (let i = 0; i < await xpResult.snapshotLength; i++) {
    const elem = xpResult.snapshotItem(i)
    console.log(await elem.outerHTML)
  }
  console.log('Done')

  await hero.close();
})();

So I can use it in this way. But still I wonder if the hero.xpathSelectorAll can be used somehow or if it is definitely broken and the document.evalueate is the only way.

Anyway, thanks a lot for your help. This is really appreciated.