rchipka / node-osmosis

Web scraper for NodeJS
4.12k stars 247 forks source link

Scrapping doesn't work on malformed html pages #151

Open eashish93 opened 7 years ago

eashish93 commented 7 years ago

I tried the following code for scrapping imdb, but it doesn't work due to malformed html response by imdb. I know it can be handled with process_response which accept callback function fn(data), but for this case we need handle it with external dependency which is not good. So, please replace the strict xml mode to process malformed html automatically.

osmosis
    .get('http://www.imdb.com/title/tt0848228/')
    .find('body')
    .set('body')
    .data(function(data) {
        console.log(data);   // returns empty
    });

And using other framework like x-ray, it does work. xray('http://www.imdb.com/title/tt0848228/', 'body')(console.log)

alimgafar commented 7 years ago

I'd like to know if this issue is going to be addressed. Thank you!

oliv23 commented 6 years ago

+1 I'm willing to give it a shot and try and patch this: @rchipka could you point me in the right direction as to where to look, what to change? Thanks!

rchipka commented 6 years ago

@oliv23 I believe Osmosis sets libxmljs to use non-scrict error recovery mode already. This mode cannot recover from certain errors. If there's another libxml setting that we're missing, that would be the way to fix this.