unblocked-web / double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.
MIT License
135 stars 10 forks source link

unclear how well non-interactive crawler stacks are supported #60

Open GlenDC opened 2 years ago

GlenDC commented 2 years ago

When looking at https://stateofscraping.org one can see that Scrapy and Curl are also tested. It is however unclear how to support such non-interactive stacks myself with this framework.

An assignment contains multiple plugins/pages, for which each page has the following interface:

https://github.com/ulixee/double-agent/blob/89c194335b6c0382ac4d1dce235898d242db0a02/collect/interfaces/ISessionPage.ts#L1-L7

In a way this question is related to #58 , but this question really focusses on how we ourselves would implement such a non-interactive stack runner? It could help if the curl & scrapy stack is also implemented in this repo, given it anyway is already shown in those test results.

GlenDC commented 2 years ago

I tried some days ago by the way to simply ignore these click/wait interact tasks and just go to the URL's, so perhaps that's just the way to go? Not clear either way. Could be a nice addition to the documentation if it can be made clear why these things are required and what tests cannot be supported if one does not complete these parts of the assignments.

Perhaps in my proposal to make the runners useable as a library it could also make sense to be able to configure the capabilities of the stack (e.g. no interaction, or no screen, or no JS, etc... as for such stacks that miss certain capabilities it might be a bit silly to get them returned without the ability to complete these.

GlenDC commented 2 years ago

I made an example in my own repo with a curl implementation, as an easy to reproduce example. It makes use of some custom code and is an evolution of the interface part of PR https://github.com/ulixee/double-agent/pull/56.

If I run that runner however I get errors and most (if not all) is missing for tests, so it seems that stacks without possibility of interaction (time/keyboard/mouse) aren't supported out of the box?

mac-os-10-11--chrome-89-0 IS MISSING browser-codecs
mac-os-10-11--chrome-89-0 IS MISSING browser-dom-environment
mac-os-10-11--chrome-89-0 IS MISSING browser-fingerprints
mac-os-10-11--chrome-89-0 IS MISSING http-assets
/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:25
                throw new Error(`no cookies created for ${key}`);
                      ^

Error: no cookies created for http-SubDomainRedirect
    at CheckGenerator.extractChecks (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:25:23)
    at new CheckGenerator (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:13:14)
    at HttpCookies.runIndividual (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/index.js:27:32)
    at Analyze.addIndividual (/my-fork-based-repo/double-agent/analyze/index.js:66:38)
    at analyzeAssignmentResults (/my-fork-based-repo/double-agent/runner/lib/analyzeAssignmentResults.js:40:31)
    at async configureTestAndAnalyzeStack (/my-fork-based-repo/stack-common/lib/stack.js:65:5)

Which makes me wonder how you got to the results for curl as mentioned in in https://github.com/ulixee/double-agent/issues/58?

Runner Code:

import { IRunner, IRunnerFactory } from '@double-agent/runner/interfaces/runner';
import IAssignment from '@double-agent/collect-controller/interfaces/IAssignment';
import ISessionPage from '@double-agent/collect/interfaces/ISessionPage';

import util from 'util';
import { exec as execNonPromise } from 'child_process';
const exec = util.promisify(execNonPromise);

class CurlRunnerFactory implements IRunnerFactory {
  public runnerId(): string {
      return 'curl';
  }

  public async startFactory() {
    return;  // nothing to manage, we'll spawn on the fly
  }

  public async spawnRunner(assignment: IAssignment): Promise<IRunner> {
    return new CurlRunner(assignment.userAgentString);
  }

  public async stopFactory() {
    return;
  }
}

class CurlRunner implements IRunner {
  userAgentString: string;
  lastPage?: ISessionPage;

  constructor(userAgentString: string) {
    this.userAgentString = userAgentString;
  }

  public async run(assignment: IAssignment) {
    console.log('--------------------------------------');
    console.log('STARTING ', assignment.id, assignment.userAgentString);
    let counter = 0;
    try {
      for (const pages of Object.values(assignment.pagesByPlugin)) {
        counter = await this.runPluginPages(assignment, pages, counter);
      }
      console.log(`[%s.✔] FINISHED ${assignment.id}`, assignment.num);
    } catch (err) {
      console.log('[%s.x] Error on %s', assignment.num, this.lastPage?.url, err);
      process.exit();
    }
  }

  async runPluginPages(
    assignment: IAssignment,
    pages: ISessionPage[],
    counter: number,
  ) {
    let isFirst = true;
    let currentPageUrl;
    for (const page of pages) {
      this.lastPage = page;
      const step = `[${assignment.num}.${counter}]`;
      if (page.isRedirect) continue;
      if (isFirst || page.url !== currentPageUrl) {
        console.log('%s GOTO -- %s', step, page.url);
        const statusCode = await fetchResource(page.url, this.userAgentString);
        if (statusCode >= 400) {
          console.error(`${statusCode}, url: ${page.url}`);
          continue;
        }
      }
      isFirst = false;

      if (page.waitForElementSelector) {
        console.log('%s waitForElementSelector -- %s: Ignore no support by curl', step, page.waitForElementSelector);
      }

      if (page.clickElementSelector) {
        console.log('%s Wait for clickElementSelector -- %s: Ignore no support by curl', step, page.clickElementSelector);
      }
      counter += 1;
    }

    return counter;
  }

  async stop() {
    return;
  }
}

async function fetchResource(url: string, userAgentString: string): Promise<number> {
  const { stdout } = await exec(`curl -k -s -o /dev/null -w "%{http_code}" -H 'user-agent: ${userAgentString}' -XGET '${url}'`);
  return parseInt(stdout.trim());
}

export { CurlRunnerFactory };
GlenDC commented 2 years ago

Related to this I would also suggest to perhaps not fail the analyze code on such failures but more use it in the sense that the test failed? Giving it 0 score on that test, as that is essentially what it boils down to?

blakebyrnes commented 2 years ago

@GlenDC I'm not sure the best way to give some overall info to your PRs here.

In version one of DoubleAgent, we had one big combined repo that had:

This proved to just be WAY too confusing to come into (as evidenced by @calebjclark trying to add some stuff into it).

We also started thinking about how to create results that a normal human being could reason through. The old scraper report was hard to understand what was actually wrong when you failed a test. A lot of @calebjclark's work was translating our results into something that looked like pseudo-code on the new website design.

After the re-organization, we ended up with:

This brings us back to your original question - below is the CURL implementation in that repo. NOTE: this hasn't been updated/run in a while. I'm not sure how well it currently runs.

forEachAssignment({ scraperFrameworkId }, async assignment => {
  const curl = new Curl();
  curl.setOpt('USERAGENT', assignment.useragent);
  curl.setOpt('SSL_VERIFYPEER', 0);
  curl.setOpt('COOKIEJAR', __dirname + '/cookiejar.txt');
  curl.setOpt('COOKIESESSION', 1);
  curl.setOpt('FOLLOWLOCATION', 1);
  curl.setOpt('AUTOREFERER', 1);

  for (const pages of Object.values(assignment.pagesByPlugin)) {
    for (const page of pages) {
      console.log(page);
      if (curl.getInfo(Curl.info.EFFECTIVE_URL) !== page.url) {
        try {
          console.log('GET ', page.url);
          await httpGet(curl, page.url);
        } catch (error) {
          console.log(`ERROR getting page.url: ${page.url}`, error);
          throw error;
        }
      }
      if (page.clickDestinationUrl) {
        try {
          console.log('GET click dest', page.clickDestinationUrl);
          await httpGet(curl, page.clickDestinationUrl);
        } catch (error) {
          console.log(`ERROR getting page.clickDestinationUrl: ${page.clickDestinationUrl}`, error);
          throw error;
        }
      }
    }
  }
  curl.close();
}).catch(console.log);

async function httpGet(curl: Curl, url: string) {
  curl.setOpt('URL', url);
  const finished = new Promise((resolve, reject) => {
    curl.on('end', resolve);
    curl.on('error', reject);
  });
  curl.perform();
  await finished;
}

Moving to other repos:


Hopefully this helps gives some background. Back to your questions:

GlenDC commented 2 years ago

If you will I do not mind helping with the design and development of that.

In my opinion you're already pretty close to it being allowed for people to run against their own stacks. Either way that's a bit out of scope though as I handled that part in issue #59. In there I also already shown how I did it already and with minimal work. You're repo was pretty much there. I do not expect not do I think it is realistic given the time budget constrains that it all has to be super shiny and fancy. Goal was simply to avoid having to modify the double-agent code in order to make it testable against one's own stack and to ensure one doesn't have to pull in dependencies from stacks implemented by Double-Agent as an example.

For what's me considered that part is done except for the part where we would have to find some alignment, if that is possible at all. Than it would just be about documenting some bits, and getting to work on the results.

Furthermore I am certainly also not looking for fancy error reporting, if something looks like psuedocode or just very verbose output files, I honestly do not mind, and again I also do not mind to contribute in that part, just need to find a way to work together ,if you fancy that idea.

At this stage of my fork of double-agent (and honestly the code changes are pretty minimal I would think) all that would still be required is the ability for one to generate their own assignments and analyse them. Once that is done the repo is as flexible as one can hope for, while still being useable out of the box as it is today for the example stacks :)

GlenDC commented 2 years ago

I'm by the way really looking for the ability similar to state of scraping but than automated in w/e format (I do not mind that part) to figure out what checks fail and which succeed for each of the individual layers and categories. That and the ability to also plugin some custom ones where desired.

I can contribute dev time into this as well as ideas. My hope was that I could achieve that with double agent, but the reports out of analyser do not tell me much, if anything at all.