Open suntong opened 7 years ago
Here is the script that I prepared to make it easier for you to get started,
var definition = {
iterator: 'div.code-list > div.code-list-item',
data: {
FullName: {sel: 'p > a:nth-child(1)'}, // extract Text!
FileName: {sel: 'p > a:nth-child(2)'}, // extract Text!
UpdatedAt: {sel: 'span.updated-at > relative-time', attr: 'datetime'},
Language: {sel: 'span.language'} // extract Text!
}
};
artoo.ajaxSpider(
function(i) {
# click and follow url of "divp.pagination a.next_page" -> "href"
}, {
scrape: definition,
limit: 9,
concat: true,
throttle: 500,
done: function(data) {
artoo.log.debug('Finished retrieving data. Downloading...');
artoo.saveCsv(data);
}
}
);
Hello @suntong. The ajaxSpider
can only use ajax and does therefore retrieve only static HTML. What you need is either to find the correct way to query github (by spoofing user-agent or else etc. to make the site believe your are a regular user) or use more complex solutions such as PhantomJS or build some Chrome extension to create an automaton. You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.
Alternatively, this might be a stupid question but isn't the Github API able to give you the information you need? I expect them to have pretty good scraping & crawling defenses.
Thank you Guillaume for your reply.
isn't the Github API able to give you the information you need?
I thought so, but having failed to get it, and double-checked that I've done nothing wrong, I wrote Github a question, and here is what they replied:
(now) it's not possible to perform global code searches via the API as mentioned in this blog post:
https://developer.github.com/changes/2013-10-18-new-code-search-requirements/
We'd like to allow global code searches via the API in the future, but I can't promise when it will happen. So in the meantime, code search must be scoped to a user, organization, or repository e.g. scoping to
user:james2doyle
:
As for,
You can alternatively try sandcrawler to do so. I can try to help you but cannot do so before next week unfortunately.
That'd be really appreciated. I've confirmed that Scrapinghub Portia isn't able to handle it, even with the JavaScript feature turned on. Now I've confirmed that artoo
can't handle it either. I.e., I've exhausted all the tools I know. If I can't make sandcrawler work, I won't be sure it would be my limitation, or it is naturally impossible. I'll be inclined to believe the latter. So if you can draw an conclusion that it is impossible, then it'd be the last nail I need.
Don't worry about time. As long as you keep it in mind and find sometime to look into it later. Thanks
Can I ask you what you are spoofing when performing your HTTP calls?
Sorry I don't quite understand the question -- are you talking about Scrapinghub, or artoo, or doing in the browser, or ...?
Using any tool really. What are the headers you send? Do you handle cookies, session etc.?
No I didn't do anything special, other than visiting them in the browser.
So you either need to watch the HTTP queries made so you can replicate them the best way you can or else you can also use artoo to build some automaton that will:
The first parameter of
artoo.ajaxSpider
is,This means that artoo ajaxSpider only follows urls that either pre-given or somewhat calculated. However, It is possible for artoo's ajaxSpider to,
The reason I'm asking is that github code search can only work in browser. Nothing else works. I.e., if you click/paste the following url, you will only get the "We could not perform this search" error.
https://github.com/search?utf8=%E2%9C%93&q=%22github.com%2Fgoadesign%2Fgoa%2Fdesign%2Fapidsl%22+language%3Ago&type=Code&ref=searchresults
However, if you do github code search in browser, searching for
Then click on the 2nd choice on the left, "Code", you will get "We’ve found 294 code results" and a list of all the hits. If you compare the url, you will find that this "working" url is exactly as above. Try with different search items, and try paste the "working" url in a new browser window five-minutes later, you will find that the "working" url is no longer working.
This is why I need artoo's ajaxSpider to click and follow that "Next" url. Is this possible? Thanks!