medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Spider that crawls pages through ajax "click" #257

Open suntong opened 7 years ago

suntong commented 7 years ago

The first parameter of artoo.ajaxSpider is,

urlList array | function : the list of urls to request through ajax or, alternatively, a function taking as arguments the index of the iteration and the data of the last request, and returning either the desired url or false to break the spider.

This means that artoo ajaxSpider only follows urls that either pre-given or somewhat calculated. However, It is possible for artoo's ajaxSpider to,

  1. Find the "Next"-page url from the first page, then
  2. "Click" and follow that url onto the following pages?

The reason I'm asking is that github code search can only work in browser. Nothing else works. I.e., if you click/paste the following url, you will only get the "We could not perform this search" error.

https://github.com/search?utf8=%E2%9C%93&q=%22github.com%2Fgoadesign%2Fgoa%2Fdesign%2Fapidsl%22+language%3Ago&type=Code&ref=searchresults

However, if you do github code search in browser, searching for

"github.com/goadesign/goa/design/apidsl" language:go

Then click on the 2nd choice on the left, "Code", you will get "We’ve found 294 code results" and a list of all the hits. If you compare the url, you will find that this "working" url is exactly as above. Try with different search items, and try paste the "working" url in a new browser window five-minutes later, you will find that the "working" url is no longer working.

This is why I need artoo's ajaxSpider to click and follow that "Next" url. Is this possible? Thanks!

suntong commented 7 years ago

Here is the script that I prepared to make it easier for you to get started,

var definition = {
  iterator: 'div.code-list > div.code-list-item',
  data: {
    FullName: {sel: 'p > a:nth-child(1)'}, // extract Text!
    FileName: {sel: 'p > a:nth-child(2)'}, // extract Text!
    UpdatedAt: {sel: 'span.updated-at > relative-time', attr: 'datetime'},
    Language: {sel: 'span.language'} // extract Text!
  }
};

artoo.ajaxSpider(
  function(i) {
    # click and follow url of "divp.pagination a.next_page" -> "href"
  }, {
    scrape: definition,
    limit: 9,
    concat: true,
    throttle: 500,
    done: function(data) {
      artoo.log.debug('Finished retrieving data. Downloading...');
      artoo.saveCsv(data);
    }
  }
);
Yomguithereal commented 7 years ago

Hello @suntong. The ajaxSpider can only use ajax and does therefore retrieve only static HTML. What you need is either to find the correct way to query github (by spoofing user-agent or else etc. to make the site believe your are a regular user) or use more complex solutions such as PhantomJS or build some Chrome extension to create an automaton. You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.

Alternatively, this might be a stupid question but isn't the Github API able to give you the information you need? I expect them to have pretty good scraping & crawling defenses.

suntong commented 7 years ago

Thank you Guillaume for your reply.

isn't the Github API able to give you the information you need?

I thought so, but having failed to get it, and double-checked that I've done nothing wrong, I wrote Github a question, and here is what they replied:

(now) it's not possible to perform global code searches via the API as mentioned in this blog post:

https://developer.github.com/changes/2013-10-18-new-code-search-requirements/

We'd like to allow global code searches via the API in the future, but I can't promise when it will happen. So in the meantime, code search must be scoped to a user, organization, or repository e.g. scoping to user:james2doyle:

As for,

You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.

That'd be really appreciated. I've confirmed that Scrapinghub Portia isn't able to handle it, even with the JavaScript feature turned on. Now I've confirmed that artoo can't handle it either. I.e., I've exhausted all the tools I know. If I can't make sandcrawler work, I won't be sure it would be my limitation, or it is naturally impossible. I'll be inclined to believe the latter. So if you can draw an conclusion that it is impossible, then it'd be the last nail I need.

Don't worry about time. As long as you keep it in mind and find sometime to look into it later. Thanks

Yomguithereal commented 7 years ago

Can I ask you what you are spoofing when performing your HTTP calls?

suntong commented 7 years ago

Sorry I don't quite understand the question -- are you talking about Scrapinghub, or artoo, or doing in the browser, or ...?

Yomguithereal commented 7 years ago

Using any tool really. What are the headers you send? Do you handle cookies, session etc.?

suntong commented 7 years ago

No I didn't do anything special, other than visiting them in the browser.

Yomguithereal commented 7 years ago

So you either need to watch the HTTP queries made so you can replicate them the best way you can or else you can also use artoo to build some automaton that will:

  1. Click on the next button
  2. Wait for the results to be rendered
  3. Collect the data
  4. Loop until finished