tidev / gittio

Search & Install all Titanium Modules and Alloy Widgets on GitHub
http://gitt.io
Other
24 stars 19 forks source link

gitTio doesn't index anymore #116

Closed FokkeZB closed 8 years ago

FokkeZB commented 8 years ago

Since end of august gitTio doesn't index anymore. This seems to be because GitHub doesn't like that we crawl the browser search. Since it broke I get these:

2016-09-19 17:00:54:3573 [ERROR] You have triggered an abuse detection mechanism. Please wait a few minutes before you try again. in /home/fokke/gitt.io/vendor/knplabs/github-api/lib/Github/HttpClient/HttpClient.php at 138

'a:2:{s:4:"code";i:0;s:5:"trace";s:354:"#0 /home/fokke/gitt.io/vendor/knplabs/github-api/lib/Github/HttpClient/HttpClient.php(90): Github\\HttpClient\\HttpClient->request(\'search/code\', NULL, \'GET\', Array, Array)
#1 /home/fokke/gitt.io/jobs/12_zips.php(53): Github\\HttpClient\\HttpClient->get(\'search/code\', Array)
#2 /home/fokke/gitt.io/jobs.php(214): require_once(\'/home/fokke/git...\')
#3 {main}";}'

/cc @Topener, @jasonkneen

yuchi commented 8 years ago

What’s the value for the X-RateLimit-* response headers?

FokkeZB commented 8 years ago

That applies to the APIs only. Since (at least when I wrote gittio) global search has no API I use a scraping script for that.

yuchi commented 8 years ago

So probably they explicitly forbidden scraping 😨

FokkeZB commented 8 years ago

Yep, and search still requires a user, org or repo filter:

$ curl https://api.github.com/search/code?q=addClass+in:file+language:js
{
  "message": "Validation Failed",
  "errors": [
    {
      "message": "Must include at least one user, organization, or repository",
      "resource": "Search",
      "field": "q",
      "code": "invalid"
    }
  ],
  "documentation_url": "https://developer.github.com/v3/search/"
}
FokkeZB commented 8 years ago

I've updated the user agent I sent. Let's see if that fixes it, but I doubt. It stopped working after August 31st, so I guess they rolled out new security on september 1st.

The only alternative is to let people report orgs/users/repos to search. But that kind of beats the spider idea behind gitTio.

FokkeZB commented 8 years ago

Okay, so I found the issue. GitHub no longer allows you to search all repos if you are not logged in.

Here's an example of an URL the bot fetches: https://github.com/search?utf8=✓&q=moduleid+AND+guid+AND+minsdk+AND+platform+filename%3Amanifest+in%3Afile

Try it logged out and you'll see you are required to have an org/user/repo

I guess I'll have to see if I can login

yuchi commented 8 years ago

Did you try to authenticate to test the search APIs requests?

FokkeZB commented 8 years ago

It isn't explicitly stated at https://developer.github.com/v3/search/#search-code but last time I tried, even authenticated request require either repo or owner filter.

FokkeZB commented 8 years ago

I've tried again and I have good news, bad news and then good news.

The good news is that I (now) can search code in all repositories through the API.

The bad news is that the result does not include the date the result was indexed, nor can I limit the search to only include results of files that changed since I last searched.

But, I can get use another API to get the last commit of the file of each result and use that instead.

So.... Currently I'm indexing all new sources since August 31!