matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

What causes 'write after end' errors when paginating? #267

Closed kanethal closed 5 years ago

kanethal commented 7 years ago

Subject of the issue

x-ray sometimes exhibits behavior where it will continue paginating after the set limit. My current project pulls product links from an index page on our website, then paginates for more. In this case, it should return 50 links (25 per page), and then call a callback when using the following syntax:

x(startURL,scope,targets)(cb)
            .paginate(paginate)
            .limit(limit)
            .write(writeToFile);

 function cb(e,d){
            console.log('crawl ended',{e:e,d:d.length});
            if(e){rej(e)}
            else if(!d){rej(new Error('no data from' +url))}
            else{
                res(d)
            }
        }

usually the callback is called, and the scraper delivers its data gracefully, even though I can see that an additional page has been scraped (I am using a filter that outputs the current url to the console) after the prescribed two pages. Sometimes, however, I get the following error :

{ error:
         Error: write after end
             at writeAfterEnd (_stream_writable.js:193:12)
             at WriteStream.Writable.write (_stream_writable.js:240:5)
             at WriteStream.Writable.end (_stream_writable.js:477:10)
             at _stream_array (D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\lib\
stream.js:26:16)
             at next (D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\index.js:112:
13)
             at D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\index.js:243:7
             at D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\lib\walk.js:56:12
             at callback (D:\Dropbox\Dropbox\Apps\new scrape\node_modules\batch\index.js:
147:12)
             at D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\lib\walk.js:49:9
             at D:\Dropbox\Dropbox\Apps\new scrape\node_modules\x-ray\index.js:232:24,
        data: undefined } },

What does the write after end error signify, and how can I change my scraper syntax to avoid this problem? Describe your issue here.

Your environment

Expected behaviour

scraper should return the data from 2 pages

Actual behaviour

returns data from 2 pages, but console logs 3 pages worth of data or throws 'write after end' error and does not deliver data.

lathropd commented 5 years ago

Your function call should come after the chaining of paginate, limit, etc. Also, I don’t think doing so is compatible with .write()