matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Sanitize/Modify data before returning (middleware of sorts)? #140

Closed ishan-marikar closed 8 years ago

ishan-marikar commented 8 years ago

I'm writing a simple crawler that extracts information off of a yellow pages site in our country, and I'm having trouble trying to sanitize or modify the data before it is sent back to crawler.

Currently what I have is

x(URL, '.jd-table.jd-item', [{
  name: 'div.jd-itemTtile',
  telephone: 'span.jd-fields-li-value.val-horizontal',
  address: 'div.jd-itemAddress' // Field that needs to be sanitized
}])(function(error, response) {
  if (error) throw error;
  console.log(response);
});

The address field returns something like:

\r\n\r\n\n                            Address: 123, Mock Street, Earth                                         

Is there any sort of way where I can run this on the string before it is sent to the response: address.replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();

I hope this makes sense. Thank you.

0xgeert commented 8 years ago

afaik there isn't. However, nothing is stopping you from writing a couple of middleware-like functions yourself to parse the output

On Sun, Feb 28, 2016 at 2:46 PM, ishan-marikar notifications@github.com wrote:

I'm writing a simple crawler that extracts information off of a yellow pages site in our country, and I'm having trouble trying to sanitize or modify the data before it is sent back to crawler.

Currently what I have is

x(URL, '.jd-table.jd-item', [{ name: 'div.jd-itemTtile', telephone: 'span.jd-fields-li-value.val-horizontal', address: 'div.jd-itemAddress' // Field that needs to be sanitized }])(function(error, response) { if (error) throw error; console.log(response); });

The address field returns something like: "\r\n\r\n\n Address: 123, Mock Street, Earth "

Is there any sort of way where I can run this on the string before it is sent to the response: address.replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();

I hope this makes sense. Thank you.

— Reply to this email directly or view it on GitHub https://github.com/lapwinglabs/x-ray/issues/140.

ishan-marikar commented 8 years ago

Thank you, @gebrits. I'm having a bit of difficulty trying to find the documentation for x-ray. Would you be so kind enough to let me know how I could write middleware-like functions in x-ray?

0xgeert commented 8 years ago

Didn't mean to confuse you. X-ray doesn't provide you anything to that end. I just meant that you could write some helper functions for sanitization and include them in your flow when needed.

Something like:

var transformers = { cleanupWhitespace: function(response){ //your cleanup code here return cleanupWhitespace; } }

x(URL, '.jd-table.jd-item', [{ name: 'div.jd-itemTtile', telephone: 'span.jd-fields-li-value.val-horizontal', address: 'div.jd-itemAddress' // Field that needs to be sanitized }])(function(error, response) { if (error) throw error;

response = transformers.cleanupWhitespace(response);

console.log(response); });

On Tue, Mar 1, 2016 at 10:50 PM, ishan-marikar notifications@github.com wrote:

Thank you, @gebrits https://github.com/gebrits. I'm having a bit of difficulty trying to find the documentation for x-ray. Would you be so kind enough to let me know how I could write middleware-like functions in x-ray?

— Reply to this email directly or view it on GitHub https://github.com/lapwinglabs/x-ray/issues/140#issuecomment-190921687.

klmdb commented 8 years ago

Try something like this:

      address: function($, cb){

        var address      = $.find('div.jd-itemAddress').replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();

        cb(null, address);
      },
Kikobeats commented 8 years ago

This have more sense with filters, check this PR: https://github.com/lapwinglabs/x-ray/pull/145

I'm waiting that the author of the PR update it and fix tests