ulixee / hero

The web browser built for scraping
MIT License
648 stars 32 forks source link

Descriptive errors for connection issues (503) #202

Open soundofspace opened 1 year ago

soundofspace commented 1 year ago

Some errors are very vague, and this should be improved as much as possible in general. This makes it easier for user to diagnose issues, but also for developers to track down the origin of bug/issue reports.

However this issue is only for connection issues to keep to scope focussed. In the previous/this week ipify was having issue, which is used by Hero under the hood to find publicIp and proxyIp (enabled by default). This should also be documented with exclamation marks, as it is easy to miss and results in two extra requests for every scripts, and might even leak data some people don't want. This resulted in lots of errors, which were hard to diagnose. Some errors that were seen are:

Parse Error: Expected HTTP/
connection refused (503)\n<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n<html><head>\n<meta type=\"copyright\" content=\"Copyright (C) 1996-2020 The Squid Software Foundation and contributors\">\n<meta http-equiv=\"Content-Type\" CONTENT=\"text/html; charset=utf-8\">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type=\"text/css\"><!-- \n /*\n * Copyright (C) 1996-2019 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ lice

These errors are very vague, and don't really pinpoint the problem. Even if this is the original error, they should at least be wrapped in more descriptive errors, example could be publicIpLookupFailed: actual error.

The second error message can also be thrown for proxy issues (or only proxy issues, and I was just unlucky to have them at the same time?). This should also be wrapped in for example proxyConnectError: error.

blakebyrnes commented 1 year ago

This should also be documented with exclamation marks, as it is easy to miss and results in two extra requests for every scripts,

Any suggestions how to highlight this?

might even leak data some people don't want.

What data are you thinking is being leaked here? The public IP should not be added to your requests, and the proxyIP is already your ip being used on the remote machine. Did you see something unexpected here?

This should also be wrapped in for example proxyConnectError: error.

Can you share the full error stack for that? I thought it was actually wrapped in a Proxy error. The code is doing so. Maybe it's not properly recreating it on the client.

soundofspace commented 1 year ago

Any suggestions how to highlight this?

A red note or something similar here probably https://ulixee.org/docs/hero/basic-client/hero#constructor. Explaining that it does this by default, why (fix this leak), but also put a warning there that it uses extra requests, and that this could also crash your script if it doesn't work, and the two other options: configure IPs or disable this plugin.

What data are you thinking is being leaked here? The public IP should not be added to your requests, and the proxyIP is already your ip being used on the remote machine. Did you see something unexpected here?

Leaking data might be an exaggeration as it only leaks proxy IPs, but on a large scale this could give away all proxyIPs, not really a big deal, but still. A bigger issue is that it crashes a hero scripts, and if this api is not available will stop pretty much your entire codebase.

soundofspace commented 1 year ago
Error: Failed to execute script: connection refused (503)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2020 The Squid Software Foundation and contr...
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/index.ts", line 277, col 13, in buildConnectError
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/index.ts", line 226, col 7, in MitmSocket.triggerConnectErrorIfNeeded
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/index.ts", line 249, col 12, in MitmSocket.onError
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/index.ts", line 202, col 12, in MitmSocket.onMessage
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/lib/MitmSocketSession.ts", line 48, col 41, in MitmSocketSession.onMessage
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/lib/BaseIpcHandler.ts", line 165, col 10, in MitmSocketSession.onIpcData
  File "node:events", line 513, col 28, in Socket.emit
  File "node:domain", line 489, col 12, in Socket.emit
  File "node:internal/streams/readable", line 324, col 12, in addChunk
  File "node:internal/streams/readable", line 297, col 9, in readableAddChunk
  File "/app/.yarn/unplugged/@ulixee-unblocked-agent-mitm-socket-npm-2.0.0-alpha.15-2b0ff76d74/agent/mitm-socket/index.ts", line 70, col 22, in new MitmSocket
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/MitmRequestAgent.ts", line 153, col 24, in MitmRequestAgent.createSocketConnection
  File "node:internal/process/task_queues", line 95, col 5, in process.processTicksAndRejections
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/SocketPool.ts", line 78, col 26, in Object.cb
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/Queue.ts", line 95, col 19, in Queue.next
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/Queue.ts", line 40, col 19, in Queue.run
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/SocketPool.ts", line 60, col 23, in SocketPool.getSocket
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/MitmRequestAgent.ts", line 195, col 35, in MitmRequestAgent.assignSocket
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/MitmRequestAgent.ts", line 72, col 16, in MitmRequestAgent.request
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/handlers/BaseHttpHandler.ts", line 61, col 50, in HttpRequestHandler.createProxyToServerRequest
  File "node:internal/process/task_queues", line 95, col 5, in process.processTicksAndRejections
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/handlers/HttpRequestHandler.ts", line 37, col 36, in HttpRequestHandler.onRequest
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/handlers/HttpRequestHandler.ts", line 316, col 5, in Function.onRequest
  File "/app/.yarn/cache/@ulixee-unblocked-agent-mitm-npm-2.0.0-alpha.15-db2a939a0a-0c5096daab.zip/agent/mitm/lib/MitmProxy.ts", line 220, col 7, in MitmProxy.onHttpRequest
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/Resolvable.ts", line 16, col 18, in new Resolvable
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/utils.ts", line 168, col 10, in createPromise
  File "/app/.yarn/cache/@ulixee-net-npm-2.0.0-alpha.16-36ed53b680-1771ceffea.zip/node_modules/net/lib/PendingMessages.ts", line 47, col 44, in PendingMessages.create
  File "/app/.yarn/cache/@ulixee-net-npm-2.0.0-alpha.16-36ed53b680-1771ceffea.zip/node_modules/net/lib/ConnectionToCore.ts", line 153, col 50, in ConnectionToHeroCore.sendRequest
  File "node:internal/process/task_queues", line 95, col 5, in process.processTicksAndRejections
  File "/app/.yarn/cache/@ulixee-hero-npm-2.0.0-alpha.16-dcd2a28d00-4cf8e995f3.zip/node_modules/client/lib/CoreCommandQueue.ts", line 287, col 12, in CoreCommandQueue.sendRequest
  File "/app/.yarn/cache/@ulixee-hero-npm-2.0.0-alpha.16-dcd2a28d00-4cf8e995f3.zip/node_modules/client/lib/CoreCommandQueue.ts", line 229, col 16, in Object.cb
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/Queue.ts", line 95, col 19, in Queue.next
  File "/app/.yarn/cache/@ulixee-commons-npm-2.0.0-alpha.16-3eae0665d4-e774257404.zip/node_modules/commons/lib/Queue.ts", line 40, col 19, in Queue.run
  File "/app/.yarn/cache/@ulixee-hero-npm-2.0.0-alpha.16-dcd2a28d00-4cf8e995f3.zip/node_modules/client/lib/CoreCommandQueue.ts", line 218, col 8, in CoreCommandQueue.run
  File "/app/.yarn/cache/@ulixee-hero-npm-2.0.0-alpha.16-dcd2a28d00-4cf8e995f3.zip/node_modules/client/lib/CoreTab.ts", line 243, col 36, in CoreTab.goto
  File "/app/.yarn/cache/@ulixee-hero-npm-2.0.0-alpha.16-dcd2a28d00-4cf8e995f3.zip/node_modules/client/lib/Tab.ts", line 190, col 36, in Tab.goto
  File "node:internal/process/task_queues", line 95, col 5, in process.processTicksAndRejections
  File "eval at deserializeClientFnWithArgs (/app/brorun/server/dist/lib/api/resolvers.js:124:12), <anonymous>", line 20, col 1, in eval
  File "/app/brorun/core/src/service.ts", line 536, col 20, in <anonymous>
    return await fn(hero, args);
  File "--------------------------------------------------"
  File "--------------------------------------------------"
blakebyrnes commented 1 year ago

Leaking data might be an exaggeration as it only leaks proxy IPs, but on a large scale this could give away all proxyIPs, not really a big deal, but still. A bigger issue is that it crashes a hero scripts, and if this api is not available will stop pretty much your entire codebase.

Totally agree on the crash. Webrtc plus your actual requests are "leaking" your proxy ip though. I don't know that there's any way to request from a proxy IP without also telling the end site what your proxy IP is...

Part of this ticket should be to re-test the chrome arg that's supposed to be doing this by default: --force-webrtc-ip-handling-policy=default_public_interface_only. WebRTC is going to share your ip, this is just a matter of what all gets dumped in there. I tested a number of variations of this flag and none of them was properly masking the machine ip for all cases.

soundofspace commented 1 year ago

Leak was definitely not the word, but in cases where you are using hero in lots of places for a lot of different websites, it's very hard for those websites to figure out all proxyIPs. They only see a small amount of all proxies vs the api seeing them all. But either way 'leaking' Ip is not a problem, and I shouldn't even have mentioned it.

soundofspace commented 1 year ago

I'm curious aswel if chrome is masking this (if it is even working), how it does this. Random Ip, or also using an external api to get the actual public IP, or something completely else.

blakebyrnes commented 1 year ago

I'm curious aswel if chrome is masking this (if it is even working), how it does this. Random Ip, or also using an external api to get the actual public IP, or something completely else.

I think it's just part of the protocol for webrtc. The whole process is meant to be P2P, so you have to tell other nodes how to find you.