ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
667 stars 44 forks source link

TakeScreenshot cannot take a full-page screenshot (beyond viewport) #322

Open andynuss opened 3 years ago

andynuss commented 3 years ago

The first issue is that after using this snippet to create a use-once agent:

const myagent = new Agent({
  userAgent: ua,
  viewport: {
    screenHeight: 1024,
    screenWidth: 768,
    height: 1024,
    width: 768,
  }
});

and then when done scraping calling:

await myagent.close();

does work once after starting my node service that runs this function, but subsequent times the same function is called, I get this error in node console:

2021-08-04T18:32:42.557Z ERROR [/Users/andy/repos/test-repo/app/node_modules/@secret-agent/client/connections/ConnectionFactory] Error connecting to core {
  error: 'Error: connect ECONNREFUSED 127.0.0.1:63738',
  context: {},
  sessionId: null,
  sessionName: undefined
} Error: connect ECONNREFUSED 127.0.0.1:63738
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1146:16) {
  errno: -61,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 63738
}

The second surprise is the screenshot itself, taken with:

const scrollHeight: number = await myagent.executeJs(() => {
  return document.scrollingElement.scrollHeight;
});

let buffer: Buffer;
buffer = await myagent.takeScreenshot({
  format: 'png',
  rectangle: {
    scale: 1,
    height: scrollHeight,
    width: 1024,
    x: 0,
    y: 0,
  }
});

I used this url: https://www.whatsmyua.info

The visible text in the screenshot is not centered as one would expect for the page I used, but is more or less left-justified, and a large portion of the page is clipped even though I used the scrollHeight, which I checked had not grown after taking the screenshot.


The third problem is that if I call takeScreenshot this way it fails with an error, even though typescript tells me rectangle is optional:

buffer = await myagent.takeScreenshot({
  format: 'png',
});

Hope I didn't do something stupid!

blakebyrnes commented 3 years ago

Thanks for reporting.

Can you share any more code or logs from 1? Secret Agent tracks everything in session databases for each "agent session" (https://secretagent.dev/docs/advanced/session)

For 2, can you include your screenshot that got generated?

For 3, I think I broke that trying to fix a different issue.. thanks for catching.

andynuss commented 3 years ago

On 1, I found that this happens for some reason when creating an http server that calls my test scraping function, and not when I call it more than once consecutive times in the same nodejs "thread".

so here is my test function's typescript file:

/* eslint-disable no-console */
import { Agent } from 'secret-agent';
import ExecuteJsPlugin from '@secret-agent/execute-js-plugin';
import * as fs from 'fs';

const ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.165 Safari/537.36';

export async function testUrl(requestUrl: string, imageName: string): Promise<void> {
  const myagent = new Agent({
    userAgent: ua,
    viewport: {
      screenHeight: 1024,
      screenWidth: 768,
      height: 1024,
      width: 768,
    }
  });

  try {
    myagent.use(ExecuteJsPlugin);
    await myagent.goto(requestUrl);
    await myagent.waitForPaintingStable();

    const getScrollHeight = async (): Promise<number> => {
      // eslint-disable-next-line @typescript-eslint/ban-ts-comment
      // @ts-ignore
      const scrollHeight: number = await myagent.executeJs(() => {
        // eslint-disable-next-line @typescript-eslint/ban-ts-comment
        // @ts-ignore
        return document.scrollingElement.scrollHeight;
      });
      console.log('scrollHeight', scrollHeight, 'for', requestUrl);
      return scrollHeight;
    };

    const takeScreenshot = async (scrollHeight: number): Promise<void> => {
      const buffer: Buffer = await myagent.takeScreenshot({
        format: 'png',
        rectangle: {
          scale: 1,
          height: Math.max(scrollHeight, 768),
          width: 1024,
          x: 0,
          y: 0,
        }
      });
      fs.writeFileSync('../screenshots/' + imageName + '.png', buffer, 'binary');
    };

    const height1 = await getScrollHeight();
    await takeScreenshot(height1);

    const height2 = await getScrollHeight();
    if (height2 > height1) {
      console.log('oops: scrollHeight increased after taking screenshot.  Taken too soon?');
      await takeScreenshot(height2);
    }

    console.log(requestUrl + ' ok');
  } finally {
    try {
      await myagent.close();
    } catch (e) {
      console.log('unexpected error closing new Agent', e);
    }
  }
}

// (async() => {
//   await testUrl('https://example.org', 'example');
//   await testUrl('https://www.whatsmyua.info', 'myua');
// })();
andynuss commented 3 years ago

and here is my http server written in javascript that compiles the typescript function above:

/* eslint-disable prefer-template */
/* eslint-disable no-console */
const _http = require('http');
const { testUrl } = require('./test');

function ProcessExit(num) {
  process.exit(num);
}

function TextResponse(res, txt) {
  res.writeHead(200, {
    'Content-Type': 'text/plain'
  });
  res.write(txt);
  res.end();
}

function StartServer() {
  console.log('listening for scrape requests');

  const server = _http.createServer((req, res) => {
    let data = '';
    req.on('data', (chunk) => {
      data += chunk;
    });
    req.on('end', () => {
      let json;
      let err;
      try {
        json = JSON.parse(data);
      } catch (e) {
        err = e;
      }
      if (err) {
        console.log('could not parse json request:', err);
        TextResponse(res, 'could not parse json request: ' + err);
      } else if (!json) {
        TextResponse(res, 'falsy json request: ' + json);
      } else if (typeof json.requestUrl !== 'string') {
        TextResponse(res, 'json.requestUrl invalid not a string');
      } else if (typeof json.imageName !== 'string') {
        TextResponse(res, 'json.imageName invalid not a string');
      } else {
        (async() => {
          let err2;
          try {
            await testUrl(json.requestUrl, json.imageName);
          } catch (e) {
            err2 = e;
          }
          if (err2) {
            console.log('testUrl failed:', err2);
            TextResponse(res, 'testUrl failed: ' + err2);
          } else {
            TextResponse(res, 'created image in server: ' + json.imageName);
          }
        })();
      }
    });
  });

  server.setTimeout(0);
  server.listen(8888);
}

(async() => {
  try {
    StartServer();
  } catch (e) {
    console.log(e);
    if (e.stack)
      console.log('' + e.stack);
    ProcessExit(1);
  }
})();
andynuss commented 3 years ago

and here's how I invoked it from java (by running this standalone java file a second time while the node service is running):

public class SecretAgentProxy {

  private static void Test () throws IOException
  {
    String serverUrl = "http://localhost:8888";
    String requestUrl = "https://www.whatsmyua.info";
    String imageName = "my-image-" + (Common.randomInt(100) + 1);
    HashMap<LiteString, LiteString> json = new HashMap<>(2);
    json.put(LiteString.cons("requestUrl"), LiteString.cons(requestUrl));
    json.put(LiteString.cons("imageName"), LiteString.cons(imageName));
    LiteString sjson = StrictJsonEncoder.encode(json);

    // REFACTOR AUG: need to abstract most all of this everywhere I am using it
    // which is a TON of places, call it PostProxy
    //
    byte[] ba;
    ba = ExtStream.toArray(sjson);

    LengthInputStream body = null;
    try {
      HttpURLConnection conn = null;
      try {
        conn = (HttpURLConnection)(new URL(serverUrl).openConnection());
        conn.setConnectTimeout(15*1000);
        conn.setReadTimeout(45*1000);
        conn.setDoOutput(true);
        conn.setInstanceFollowRedirects(false);
        conn.setRequestMethod("POST");
        conn.setRequestProperty("Accept-Charset", "UTF-8");
        conn.setRequestProperty("Content-Type", "application/json;charset=UTF-8");
        conn.setRequestProperty("Content-Length", Integer.toString(ba.length));

        OutputStream os = conn.getOutputStream();
        try {
          os.write(ba);
          os.flush();
        } finally {
          os.close();
        }

        int status = conn.getResponseCode();
        if (status != 200)
          throw new HttpStatusCodeException(serverUrl, status);
        body = InputStreams.capture(conn.getInputStream(), true);

        try (InputStreamReader reader = new InputStreamReader(body, "UTF-8")) {
          System.out.println(ExtStream.readFully(reader));
        }
      } finally {
        if (conn != null)
          conn.disconnect();
      }
    } finally {
      if (body != null)
        body.close();
    }
  }

  public static void main (String[] args)
  {
    try {
      Test();
    } catch (Throwable t) {
      t.printStackTrace();
    }
  }
}
andynuss commented 3 years ago

I'll have to figure out how to take a session trace in a little while if you still need that.

andynuss commented 3 years ago

my-image-67

andynuss commented 3 years ago

NOTE: in my testUrl function above, for this pretty simple webpage, it seems like waitForPaintingStable() didn't work as well as it should because on my machine, the scrollHeight obtained after waitForPaintingStable was 1413, but then when after taking the screenshot and writing it to a file, when I asked for scrollHeight again, it was 1971, prompting me to "retake" the screenshot, to make sure that wasn't part of why the screenshot is clipped.

andynuss commented 3 years ago

Here's the session db:

sessions.db.zip

blakebyrnes commented 3 years ago

I think what's happening is you are using the default "full-client" SecretAgent "connection" which is built for single use scrapes, but I think you're triggering the auto-shutdown when you call close the first time (think about booting up a script and then wanting the whole thing to tear down when you close). I think you'll get more reliable behavior by spinning up a CoreServer and then pointing your agents at the persistent server (SecretAgent already comes with a client/server setup - https://secretagent.dev/docs/advanced/remote). You can run the server in the same process as your existing server if you want - doesn't have to be a separate process.

Regarding paintingStable - that event is specifically geared around the page being visible above the fold, not "all content loaded". You can add a "domContentLoaded" trigger to wait for the page to be fully "loaded" as well.

With your screenshot, it seems like your viewport width & height are mismatched in your screenshot rectangle. Could that be why it's showing up with a strange shape? I guess it doesn't explain the x/y..

andynuss commented 3 years ago

Thanks for the help. You were right about the viewport having switched the width and height. However, after fixing my code with everything you mentioned above, the screenshot still is clipped even though the height specified in takeScreenshot is always the full scrollHeight of 1971 for this url.

I don't see anything else that could explain the clipping, and in fact, now it appears that though the scrollHeight is 1971, and indeed the screenshot image height is 1971, and includes the proper background for the full 1971, somehow the text content inside the dom looks like it is being clipped to the viewports height of 768. Is this possible?

(Here's the fixed code)

/* eslint-disable no-console */
import { Agent, ConnectionFactory, ConnectionToCore, LocationStatus } from 'secret-agent';
import ExecuteJsPlugin from '@secret-agent/execute-js-plugin';
import * as fs from 'fs';

const ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.165 Safari/537.36';

let sharedConnection: ConnectionToCore = null;

function getConnection (): ConnectionToCore {
  if (sharedConnection !== null) return sharedConnection;
  sharedConnection = ConnectionFactory.createConnection({
    maxConcurrency: 4,
  });
  return sharedConnection;
}

export async function testUrl(requestUrl: string, imageName: string): Promise<void> {
  const myagent = new Agent({
    userAgent: ua,
    viewport: {
      screenHeight: 768,
      screenWidth: 1024,
      height: 768,
      width: 1024,
    },
    connectionToCore: getConnection(),
  });

  try {
    myagent.use(ExecuteJsPlugin);
    await myagent.goto(requestUrl);
    await myagent.waitForPaintingStable();
    await myagent.activeTab.waitForLoad(LocationStatus.DomContentLoaded);

    const getScrollHeight = async (): Promise<number> => {
      // @ts-ignore
      const scrollHeight: number = await myagent.executeJs(() => {
        // @ts-ignore
        return document.scrollingElement.scrollHeight;
      });
      console.log('scrollHeight', scrollHeight, 'for', requestUrl);
      return scrollHeight;
    };

    const takeScreenshot = async (scrollHeight: number): Promise<void> => {
      const buffer: Buffer = await myagent.takeScreenshot({
        format: 'png',
        rectangle: {
          scale: 1,
          height: Math.max(scrollHeight, 768),
          width: 1024,
          x: 0,
          y: 0,
        }
      });
      fs.writeFileSync('../screenshots/' + imageName + '.png', buffer, 'binary');
    };

    const height1 = await getScrollHeight();
    await takeScreenshot(height1);
    console.log(requestUrl + ' ok');
  } finally {
    try {
      await myagent.close();
    } catch (e) {
      console.log('unexpected error closing new Agent', e);
    }
  }
}
blakebyrnes commented 3 years ago

Can you see if the latest version helps your screenshot issue if you provide no rectangle?

blakebyrnes commented 3 years ago

Scratch that. I see it happening. No need to try it

blakebyrnes commented 3 years ago

NOTE for implementation.. Looks like in Chromium, you have to change the visualViewport to take a full page screenshot then restore it. We need to think about how we should think about this from a detection perspective.

andynuss commented 2 years ago

Hi, I was wondering if this has turned out to be difficult to fix from the standpoint of bot detection, since I noticed that the behavior is still the same as of the latest version. What exactly would be the detection exposure if a quick-and-dirty fix were to be done? Is it possible that you could point us to an easy approach and we could take the risk of detection ourselves in some kind of plugin?

blakebyrnes commented 2 years ago

@andynuss - I just haven't gotten to this. There's a lot of stuff on the plate to do, and this one just hasn't made it to the top of the priorities yet. You could give a plugin a try or a PR - I think for a plugin, you'd just want to be able to set the page to the full length of the page (here's how puppeteer does that: https://github.com/puppeteer/puppeteer/blob/327282e0475b1b680471cce6b9e74ecc14fd6536/src/common/Page.ts#L2664)