Closed ishan-marikar closed 8 years ago
afaik there isn't. However, nothing is stopping you from writing a couple of middleware-like functions yourself to parse the output
On Sun, Feb 28, 2016 at 2:46 PM, ishan-marikar notifications@github.com wrote:
I'm writing a simple crawler that extracts information off of a yellow pages site in our country, and I'm having trouble trying to sanitize or modify the data before it is sent back to crawler.
Currently what I have is
x(URL, '.jd-table.jd-item', [{ name: 'div.jd-itemTtile', telephone: 'span.jd-fields-li-value.val-horizontal', address: 'div.jd-itemAddress' // Field that needs to be sanitized }])(function(error, response) { if (error) throw error; console.log(response); });
The address field returns something like: "\r\n\r\n\n Address: 123, Mock Street, Earth "
Is there any sort of way where I can run this on the string before it is sent to the response: address.replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();
I hope this makes sense. Thank you.
— Reply to this email directly or view it on GitHub https://github.com/lapwinglabs/x-ray/issues/140.
Thank you, @gebrits. I'm having a bit of difficulty trying to find the documentation for x-ray. Would you be so kind enough to let me know how I could write middleware-like functions in x-ray?
Didn't mean to confuse you. X-ray doesn't provide you anything to that end. I just meant that you could write some helper functions for sanitization and include them in your flow when needed.
Something like:
var transformers = { cleanupWhitespace: function(response){ //your cleanup code here return cleanupWhitespace; } }
x(URL, '.jd-table.jd-item', [{ name: 'div.jd-itemTtile', telephone: 'span.jd-fields-li-value.val-horizontal', address: 'div.jd-itemAddress' // Field that needs to be sanitized }])(function(error, response) { if (error) throw error;
response = transformers.cleanupWhitespace(response);
console.log(response); });
On Tue, Mar 1, 2016 at 10:50 PM, ishan-marikar notifications@github.com wrote:
Thank you, @gebrits https://github.com/gebrits. I'm having a bit of difficulty trying to find the documentation for x-ray. Would you be so kind enough to let me know how I could write middleware-like functions in x-ray?
— Reply to this email directly or view it on GitHub https://github.com/lapwinglabs/x-ray/issues/140#issuecomment-190921687.
Try something like this:
address: function($, cb){
var address = $.find('div.jd-itemAddress').replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();
cb(null, address);
},
This have more sense with filters, check this PR: https://github.com/lapwinglabs/x-ray/pull/145
I'm waiting that the author of the PR update it and fix tests
I'm writing a simple crawler that extracts information off of a yellow pages site in our country, and I'm having trouble trying to sanitize or modify the data before it is sent back to crawler.
Currently what I have is
The address field returns something like:
Is there any sort of way where I can run this on the string before it is sent to the response: address.replace(/(\r\n|\n|\r)/gm,'').replace(/Address/gi,'').trim();
I hope this makes sense. Thank you.