ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
670 stars 44 forks source link

agent.goto gets stuck, and missing dependencies #146

Closed A-Posthuman closed 3 years ago

A-Posthuman commented 3 years ago

Hi, just getting started so this is probably something basic I don't understand. I had some test code I was working on under the 1.2.0 alpha build, that was working, but under 1.3.0 alpha it hangs at the first use of agent.goto().

Before I get to that example code, also wanted to report that when trying to first use secret-agent on a fresh server after npm installing it, I ran into some missing dependencies on Ubuntu 20.04 server:

error while loading shared libraries: libX11-xcb.so.1

solved by installing all these helper libraries from the Playwright docker documentation:

https://raw.githubusercontent.com/microsoft/playwright/master/utils/docker/Dockerfile.focal

Perhaps add some similar documentation somewhere in your getting started docs?

Now back to the code I'm having problems with, I changed it slightly to use different require and constructor style in 1.3.0:

const { Agent } = require('secret-agent');

async function run() {
  const agent = new Agent();
  let url = 'https://example.org';
  console.log('here 1');
  let resource = await agent.goto(url, 5000);
  console.log('here 2');
  let statusCode = await resource.response.statusCode;
  console.log('here 3');
}

run();

When I use node to run this, it prints "here 1" and then hangs seemingly indefinitely. Tried at first with no timeout, then tried adding timeout with no effect.

Lastly I was trying to mess around with some of the other examples using import instead of require, I guess these are mostly for typescript which so far I don't have much experience with, but I noticed they try to import from @secret-agent/full-client, but when I look in my node_modules/@secret-agent dir on my server, I only see a client dir, not full-client. Not sure if that's normal or not.

Thanks for any help!

blakebyrnes commented 3 years ago

Hi @A-Posthuman, sorry you're having issues on 1.3. Great feedback on the playwright install guide. I've looked through some of their code they run right when you launch Chromium the first time that does similar inspection of all the dependencies. Very cool stuff I'd love to incorporate at some point.

Our examples are written to work in the monorepo, which is why they reference "@secret-agent/full-client. That full-client package is what gets deployed as "secret-agent", so they're the same thing. It's definitely an oddity we haven't solved of going from monorepo to production seamlessly. You should be able to just switch out the import to run the .js or .mjs files in there (which are Javascript examples - mjs is emca modules, which got added in node 13).

For your hanging problem, can you either a) send your session database (located at Path.join(Os.tmpdir(), '.secret-agent') and then the latest hang? or b) add process.env.DEBUG=true to your test file above the import and re-run it?

A-Posthuman commented 3 years ago

Thanks for the careful explanation. I have some more to read to fully grok the monorepo stuff. I conducted a test on my 1.3.0 alpha 0 install of the examples dir proxy-example.mjs, only change was commenting out the proxy config since I don't have one handy to use, but get an import error:

Error [ERR_MODULE_NOT_FOUND]: Cannot find package '@secret-agent/full-client'

If I change the import line to just be 'secret-agent' then I get a TypeError:

import agent from 'secret-agent';

(async () => {
//   await agent.configure({
//     upstreamProxyUrl: `socks5://${process.env.PROXY_PASS}@proxy-nl.privateinternetaccess.com:1080`,
//   });
  await agent.goto('https://whatsmyip.com/');
  await agent.waitForPaintingStable();
  await agent.close();
})().catch(err => console.log('Caught error in script', err));

Caught error in script TypeError: agent.goto is not a function

I guess I'm still not quite understanding how to convert the examples using the import style to work outside of your monorepo setup?

Regarding the hang problem, I'm attaching the debug log you requested. Let me know if you need anything else, thx!

SecretAgent-debug-log.txt

blakebyrnes commented 3 years ago

@A-Posthuman your syntax is correct by changing it to import agent from 'secret-agent'... I'm not sure what's going on quite yet. What version of nodejs are you using? Any chance you could send me the database export too? I'd love to see what Devtools is doing (the messages are logged into the db). It kind of looks like it's hanging before it even has a chance to run a "goto".

Also, are you running these in a project, or from a global install? Or a project?

FYI - we're actually going to change the monorepo so the @secret-agent/full-client package will be called secret-agentduring dev and our examples will be paste-able from the monorepo to regular code.

A-Posthuman commented 3 years ago

I'm pretty new at developing with node, so perhaps I have something misconfigured. The sequence of events is:

  1. Spin up a free tier AWS Ubuntu 20.04 server
  2. Do some basic apt-get'ing, including sudo apt-get install -y nodejs
  3. This seems to come with npm 6.14.10, the current LTS version
  4. Install those other dependencies from the Playwright docker list
  5. In /home/ubuntu, run "node init -y"
  6. Then run node install secret-agent. secret-agent is then setup in the package.json file in /home/ubuntu
  7. Upload my test program, try running it with node from /home/ubuntu

Should I try updating npm to the latest version instead of LTS?

I'm attaching the most recent sessions.db you requested

sessions.zip

blakebyrnes commented 3 years ago

Ok, that sounds pretty reasonable. I don't think you have a node version issue. Could you grab the db with the uuid as the name? Sessions.db (the one you sent) only tracks all the individual sessions that occur. It looks like your session that crashed will be called 28ffae40-6758-11eb-aa1c-5f1418d99c25.db in that same directory.

A-Posthuman commented 3 years ago

Ok, here you go:

28ffae40-6758-11eb-aa1c-5f1418d99c25.zip

blakebyrnes commented 3 years ago

Ok, I had to create an ec2 box to figure this out. It's hanging trying to launch Replay since you are on a remote box. You can turn off replay with agent.configure, or with environment variables (you can set env variables in your script as per below, or via command line/bash).

I have a fix for the node modules not quite working - it's a glitch in our npm package publishing. I'll look for a way to fix not trying to launch replay on a headless machine. Thanks for reporting!!

process.env.SA_SHOW_REPLAY="false";
const agent = require('secret-agent').default;

(async () => {
  await agent.goto('https://whatsmyip.com/');
  await agent.waitForPaintingStable();
  await agent.close();
})().catch(err => console.log('Caught error in script', err));
A-Posthuman commented 3 years ago

Just a follow up to the import issue, after upgrading to 1.3.0-alpha.1, instead of a TypeError, now getting different issue. Sample code from the proxy-example.mjs example:

process.env.SA_SHOW_REPLAY="false";

import agent from 'secret-agent';

(async () => {
//   await agent.configure({
//     upstreamProxyUrl: `socks5://${process.env.PROXY_PASS}@proxy-nl.privateinternetaccess.com:1080`,
//   });
  await agent.goto('https://whatsmyip.com/');
  await agent.waitForPaintingStable();
  await agent.close();
})().catch(err => console.log('Caught error in script', err));

results in:

$ node SecretAgentTest3.mjs internal/process/esm_loader.js:74 internalBinding('errors').triggerUncaughtException( ^

Error [ERR_MODULE_NOT_FOUND]: Cannot find module '/home/ubuntu/node_modules/secret-agent/index.mjs' imported from /home/ubuntu/SecretAgentTest3.mjs Did you mean to import secret-agent/index.js? at finalizeResolution (internal/modules/esm/resolve.js:276:11) at moduleResolve (internal/modules/esm/resolve.js:699:10) at Loader.defaultResolve [as _resolve] (internal/modules/esm/resolve.js:810:11) at Loader.resolve (internal/modules/esm/loader.js:86:40) at Loader.getModuleJob (internal/modules/esm/loader.js:230:28) at ModuleWrap. (internal/modules/esm/module_job.js:56:40) at link (internal/modules/esm/module_job.js:55:36) { code: 'ERR_MODULE_NOT_FOUND' }

blakebyrnes commented 3 years ago

We'll publish emcascript modules correctly one of these days... 🤦

We're not copying the .mjs files into the build, so I'm 🤞 that's the only other thing I'm missing. It works well locally! :)