matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 348 forks source link

Improve communication with driver #197

Open bisubus opened 8 years ago

bisubus commented 8 years ago

The suggestion is to improve communication with driver.

The concept of 'driver' suits most simple cases, but the interaction with driver is allowed at configuration time, not at run time.

The template for X-ray driver is

function driverFactory(opts) {
  return function driver(ctx, callback) { ... }
}

There's may be no driver factory function at all or it can be a class, the result is the same - a driver (driver function) can get configuration-time data from parent scope or some context (if supplied to X-ray as driver.bind(context)).

Configuration data may be changed at some point from the outside, but there's no guarantee that it will affect only one request and not the others, it is not possible to pass data for specific request.

It appears that run-time data is supplied to driver as ctx with no extra arguments, and ctx.url is the only thing that can be directly affected by X-ray API.

bisubus commented 8 years ago

To my knowledge, the only type of hooks that can be used to intercept and transform data are filters. Let's suggest that we need to pass some of the data from current field to following request. Data is some run time stuff - session ID or any value that should be taken into account by the driver on next request.

This works as long as data is supposed to be passed in URI via GET:

const x = Xray({ filters: { tokenInterceptor } }).driver(driver);

const tokenInterceptor = (token) => `${URL}?token=${token}`;

x('...', [{
  nested: x('[data-token]@data-token | tokenInterceptor', .nested');
  ...
}])

This may even work for complex scenarios where several fields are supposed to be used together, because filter function has got this.$, which may be used to make custom selections.

The limitations appear when passed data is supposed to be used by the driver as anything but URL. A cookie, a header or POST data - name it all. Current design makes this impossible, not just complicated. The only way is to break a chain, iterate through results and attenuate the driver to use relevant cookie/header/whatever... looks like we've just efficiently defied the purpose of crawler and returned to low-level http.request+cheerio tandem.

ctx context is formed by x-ray-crawler, the only thing that it accepts from the outside (namely, filter function) is URL. It is not possible to just pass an object instead of url - URLs are validated before they are passed to the following request. To be passed to driver, data must be serialized and appended to URL. The filter becomes

const tokenInterceptor = (token) => {
  return URL + SEPARATOR + JSON.stringify({
    method: 'POST',
    header: { 'Token': token },
    data: { token }
  });
}

Then a driver splits the payload from URL and unserializes it. Finally, there is a way to switch to POST and pass all data that belongs to the request. That was easy.

I see no way to overcome design flaws by other means than ugly serialization hack. How can the situation be improved?