Closed ijt closed 5 years ago
Here is a first attempt at a query to order the repositories based on information already present in the repo
table. The idea is to order descending by the time elapsed between the creation of the repo and its last commit. That favors long-lived repos:
sg=# select uri, created_at, updated_at from repo where not fork and created_at >= '2000-01-01 00:00:00+00' and updated_at is not null order by updated_at - created_at desc, length(uri) asc limit 40;
uri | created_at | updated_at
--------------------------------------------------+-------------------------------+-------------------------------
github.com/Financial-Times/aggregate-healthcheck | 2016-02-17 10:13:15.031351+00 | 2019-07-10 02:29:20.762954+00
github.com/unknwon/the-way-to-go_ZH_CN | 2016-03-09 15:46:45.371299+00 | 2019-07-22 10:58:11.08837+00
github.com/gorilla/handlers | 2016-03-11 22:44:34.672361+00 | 2019-07-23 09:42:08.684536+00
github.com/sourcegraph/srclib | 2016-02-17 03:32:51.463969+00 | 2019-06-29 01:00:13.743766+00
github.com/intercom/intercom-go | 2016-03-09 16:26:33.57322+00 | 2019-07-20 13:13:55.074747+00
github.com/segmentio/go-prompt | 2016-03-15 19:30:33.813837+00 | 2019-07-23 18:57:09.05524+00
github.com/caddyserver/caddy | 2016-02-25 18:46:38.438014+00 | 2019-07-02 19:01:15.123117+00
github.com/Redth/PushSharp | 2016-03-18 11:18:10.861609+00 | 2019-07-23 02:08:25.703996+00
github.com/RehabMan/OS-X-Voodoo-PS2-Controller | 2016-03-12 23:59:12.071534+00 | 2019-07-17 09:54:57.882249+00
github.com/gorilla/pat | 2016-03-09 15:41:51.70236+00 | 2019-07-13 18:12:26.135696+00
github.com/sourcegraph/annotate | 2016-02-17 08:12:15.992987+00 | 2019-06-21 03:54:10.30054+00
github.com/square/leakcanary | 2016-03-21 18:45:31.841576+00 | 2019-07-23 13:25:30.272823+00
github.com/google/gxui | 2016-03-09 15:47:00.773795+00 | 2019-07-11 09:15:44.556037+00
github.com/sourcegraph/srclib-docker | 2016-02-17 08:12:24.361623+00 | 2019-06-19 19:22:30.702075+00
github.com/joeshaw/envdecode | 2016-02-25 20:41:06.675202+00 | 2019-06-27 13:51:16.429081+00
github.com/Caliburn-Micro/Caliburn.Micro | 2016-03-18 15:05:34.098663+00 | 2019-07-18 14:15:52.604164+00
github.com/vlucas/phpdotenv | 2016-03-22 03:08:54.826629+00 | 2019-07-21 09:17:43.381621+00
github.com/h5bp/html5-boilerplate | 2016-03-21 18:50:32.472869+00 | 2019-07-20 13:18:12.525006+00
github.com/quartznet/quartznet | 2016-03-18 15:10:12.226925+00 | 2019-07-17 07:22:25.731509+00
github.com/angular-ui/bootstrap | 2016-03-21 18:50:49.597421+00 | 2019-07-20 07:57:53.391567+00
github.com/moq/moq4 | 2016-03-18 12:16:39.087041+00 | 2019-07-16 12:16:03.736697+00
github.com/clojure/clojure | 2016-03-21 18:45:39.691456+00 | 2019-07-19 16:29:45.94775+00
github.com/WP-API/WP-API | 2016-03-22 03:08:50.934113+00 | 2019-07-19 22:49:01.483873+00
github.com/zenorocha/clipboard.js | 2016-03-21 18:50:36.971436+00 | 2019-07-19 08:06:26.994637+00
github.com/bolt/bolt | 2016-03-22 03:09:22.124677+00 | 2019-07-19 03:06:12.916513+00
github.com/autofac/Autofac | 2016-03-18 09:59:22.343739+00 | 2019-07-15 06:45:35.145985+00
github.com/ReactiveCocoa/ReactiveCocoa | 2016-03-22 14:14:07.976028+00 | 2019-07-19 02:41:39.834086+00
github.com/FriendsOfPHP/Goutte | 2016-03-22 03:09:13.922934+00 | 2019-07-18 07:11:47.89508+00
github.com/gorilla/context | 2016-03-11 19:35:49.503091+00 | 2019-07-06 23:39:05.51818+00
github.com/SVProgressHUD/SVProgressHUD | 2016-03-22 14:14:10.718432+00 | 2019-07-17 07:43:44.937553+00
github.com/gorilla/mux | 2016-03-04 06:30:33.74973+00 | 2019-06-28 19:27:16.849543+00
github.com/mailgun/godebug | 2016-03-09 17:24:52.213674+00 | 2019-07-03 04:24:40.655913+00
github.com/siddontang/ledisdb | 2016-03-16 03:31:22.919918+00 | 2019-07-09 05:48:53.687457+00
github.com/EventStore/EventStore | 2016-03-18 13:01:08.838797+00 | 2019-07-11 12:01:33.068471+00
github.com/yiisoft/yii2 | 2016-03-22 03:08:58.366864+00 | 2019-07-15 01:46:31.945273+00
github.com/photonstorm/phaser | 2016-03-21 18:50:32.937426+00 | 2019-07-14 16:36:00.438966+00
github.com/sourcegraph/vcsstore | 2016-02-17 08:12:28.65143+00 | 2019-06-11 04:45:51.513217+00
github.com/walkor/Workerman | 2016-03-22 03:08:55.391829+00 | 2019-07-14 17:35:39.246456+00
github.com/fiorix/freegeoip | 2016-03-09 17:24:21.759521+00 | 2019-07-01 21:00:42.782485+00
github.com/getlantern/lantern | 2016-03-09 15:41:41.04518+00 | 2019-06-30 10:44:08.714888+00
(40 rows)
Time: 748.220 ms
Some popular repositories show up this way.
sg=# create temp table toprepos as select uri, created_at, updated_at from repo where not fork and created_at >= '2000-01-01 00:00:00+00' and updated_at is not null order by updated_at - created_at desc limit 20000;
SELECT 20000
Time: 981.875 ms
sg=# select uri from toprepos where uri like '%/vim/vim';
uri
--------------------
github.com/vim/vim
(1 row)
Time: 20.985 ms
sg=# select uri from toprepos where uri like '%/torvalds/linux';
uri
---------------------------
github.com/torvalds/linux
(1 row)
Time: 29.581 ms
Funnily, sourcegraph/sourcegraph
doesn't show up, but that's only because its created_at
field is artificially recent.
sg=# select uri from toprepos where uri like '%/sourcegraph/sourcegraph';
uri
-----
(0 rows)
I don't think we need to worry about that though. This query is probably good enough to get started and start giving some results.
There is still some work to do on this. Mainly we need to get zoekt running on sourcegraph.com.
I ran github.com/ijt/reposize
on the first 1000 repos and found that they take up about 15G of space. Based on the heuristic in scale.md
, that means we should allocate about 45G for the zoekt-webserver
pod. The zoekt-indexserver
pod should get about 3G since that's about as big as the repositories get (for example github.com/BabylonJS/Babylon.js).
It's working now:
Currently sourcegraph.com doesn't show anything useful if you do a simple search without specifying a repo. For example, say we want to see an example of how to use
"d3.selectAll"
. Currently sourcegraph.com gives this result:It would be more useful if this search would show some examples of using this function in some popular repositories.
Here are some steps to get there:
zoekt-webserver
by summing the sizes of all the repos, probably using http://github.com/ijt/reposizezoekt-webserver
andzoekt-sourcegraph-indexserver
on sourcegraph.comzoekt-sourcegraph-indexserver
asks for the repositories to index, give it say 20k repositories at randomrepo
field, search over some plausible repos instead of serving up a suggestion box.