medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.
http://medialab.github.io/sandcrawler/
GNU Lesser General Public License v3.0
107 stars 12 forks source link

body.match is undefined #185

Closed kevinrademan closed 3 years ago

kevinrademan commented 8 years ago

I got the body.match is undefined error while using the crawler. One of the links on the site I was crawling redirected to https://www.facebook.com/unsupportedbrowser

The static engine then throws an error on this line https://github.com/medialab/sandcrawler/blob/master/src/engines/static.js#L119

Changing that line to fixes the problem

var m = (new Buffer(body)).toString("utf8").match(/<meta.*?charset=([^"']+)/);

Would you recommend that for a fix?

Yomguithereal commented 8 years ago

What does the body variable contains in your case? Is it undefined, null?

jmtoball commented 8 years ago

Hey, pitching in here as I was facing the same issue. The body variable contains a Buffer-object while a String is probably expected. Looking at the code I guess that the problem is that no encoding is set for the request. If I set an encoding using spider.config({encoding: "utf8"), it did work fine.

So this is probably related to https://github.com/medialab/sandcrawler/issues/177 to some degree ;-)