notslang / instagram-screen-scrape

scrape public instagram data w/out API access
https://npmjs.com/package/instagram-screen-scrape
GNU General Public License v3.0
162 stars 38 forks source link

Any idea of the error? Tks! #15

Closed zeta-o closed 8 years ago

zeta-o commented 8 years ago
events.js:74
W20160705-21:23:11.568(-6)? (STDERR)         throw TypeError('Uncaught, unspecified "error" event.');
W20160705-21:23:11.569(-6)? (STDERR)               ^
W20160705-21:23:11.570(-6)? (STDERR) TypeError: Uncaught, unspecified "error" event.
W20160705-21:23:11.571(-6)? (STDERR)     at TypeError (<anonymous>)
W20160705-21:23:11.571(-6)? (STDERR)     at InstagramComments.emit (events.js:74:15)
W20160705-21:23:11.571(-6)? (STDERR)     at Stream._getCommentsPage.on.on.hasMoreComments (/.../node_modules/instagram-screen-scrape/lib/comments.js:112:22)
W20160705-21:23:11.571(-6)? (STDERR)     at Stream.emit (events.js:95:17)
W20160705-21:23:11.571(-6)? (STDERR)     at Request.<anonymous> (/.../node_modules/instagram-screen-scrape/lib/util.js:25:24)
W20160705-21:23:11.571(-6)? (STDERR)     at Request.emit (events.js:95:17)
W20160705-21:23:11.572(-6)? (STDERR)     at Request.onRequestResponse (/.../node_modules/request/request.js:977:10)
W20160705-21:23:11.572(-6)? (STDERR)     at ClientRequest.emit (events.js:95:17)
W20160705-21:23:11.572(-6)? (STDERR)     at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1744:21)
W20160705-21:23:11.572(-6)? (STDERR)     at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:152:23)
zeta-o commented 8 years ago

It's happening when I try to access users with a lot of posts.

notslang commented 8 years ago

Thanks for the bug report! It shouldn't have anything to do with users that have a lot of posts, since this is coming from the comment scraping code. Could you post the code (or command) you're using that is causing this?

zeta-o commented 8 years ago

I'm getting info from a user that have a lot of post, then I'm iterating over all post and in some point getting the comments from specific post fail. I cure the problem getting the error from the emitter ->

streamOfPosts.on('error', function(msg1, msg2) { console.log('ERR: ' + msg1 + ' ' + msg2 ) })

As you can see I capture it from post but error comes from comments.

This is my code:

let postComplete = ( post ) => {
    return new Promise( ( resolve, reject ) => {
        var rawData = new Array();
        streamOfComments = new InstagramComments({post: post.id});
        streamOfComments.on('data', function(comment){
          rawData.push(comment);
        });
        streamOfComments.on('end', function(){
          resolve(rawData);
        });
    });
  }

  exports.callInstagramScrape = Meteor.bindEnvironment(function(findValue) {
    var allPosts = [];
    streamOfPosts = new InstagramPosts({username: findValue});
    streamOfPosts.on('data', function(post) {
      Fiber(function(){
        postComplete(post).then(function(postCompleted){
          post.usersComments = postCompleted;
          allPosts.push(post);
          return allPosts;
        });
      }).run();
    });

    streamOfPosts.on('error', function(msg1, msg2) { console.log('ERR: ' + msg1 + ' ' + msg2 ) })

    streamOfPosts.on('end', function(){
      sumarizeTagsAndUsers(allPosts); 
    });

  }, function(e){
    console.log("Error ->"+e);
  });
ibruno commented 8 years ago

Maybe this script can get just 12 posts (before load more button)?

notslang commented 8 years ago

@ibruno: No need for speculation: you can check what the script does right here: https://github.com/slang800/instagram-screen-scrape/blob/2b246ad31976c4f66a2642e25e5fbf216f199b22/lib/posts.coffee

@zeta-o: I'm not sure what Fiber is in that context, or why you're building up an array, rather than feeding a stream to sumarizeTagsAndUsers(), so the summary can be calculated while you wait for the network. Also, the reason why you're getting an uncaught exception is probably because you don't have any error handling attached to streamOfComments. Anyway, here's how I'd do it:

var BPromise, InstagramComments, InstagramPosts, getUserComments,
  getUserPostsAndComments, map, pump, pumpCb, ref

ref = require('./lib')
InstagramComments = ref.InstagramComments
InstagramPosts = ref.InstagramPosts

pumpCb = require('pump')
BPromise = require('bluebird')
map = require('through2')

pump = BPromise.promisify(pumpCb)

getUserComments = function (post) {
  var allComments = []
  return pump(new InstagramComments({
    post: post
  }), map.obj(function (comment, enc, cb) {
    allComments.push(comment)
    cb()
  })).then(function () {
    return allComments
  })
}

getUserPostsAndComments = function (username) {
  var allPosts = []
  return pump(new InstagramPosts({
    username: username
  }), map.obj(function (post, enc, cb) {
    getUserComments(post.id).then(function (comments) {
      post.usersComments = comments
      allPosts.push(post)
      cb()
    }).catch(cb)
  })).then(function () {
    return allPosts
  })
}

getUserPostsAndComments('slang800').then(
  JSON.stringify
).then(
  console.log.bind(console)
)

getUserPostsAndComments should return a promise that gives the data in the exact same format as your code above, or gives an error. And with the code as it is above, scraping an account with a couple thousand posts probably will result in an error, since any one of those 10k/+ network requests could fail... We should probably add a way to retry requests or something to mitigate that problem. That being said, I tested this out on an account with a couple hundred posts and it returned all of them, with comments, correctly.

Also, the code above doesn't scrape comments in parallel, so you should probably be scraping multiple accounts at once, if you want to even get close to fully utilizing your network connection.

ibruno commented 8 years ago

Yeah! Using max_id can get more.

zeta-o commented 8 years ago

Thanks @slang800 for take your time and help me, and of course thanks for create this library, It's help me a lot with my project. I'm going to change my approach trying to follow your idea.