Closed b-j-p closed 3 years ago
Hey @b-j-p, sorry for the very late reply here! Thanks for suggesting this feature and for the detailed writeup. We're thinking about how to improve the default search experience on Sourcegraph.com, so this is very useful.
A couple questions come to mind:
componentDidMount |* js
example)?Let me know if there's anything I'm missing about your proposal! And again, sorry for the late response.
Hello @attfarhan. Thanks for getting back to me about this. No apologies, please! I'm just glad to be able to help.
Yes. To your last point: I admit that a canonical search feature would be difficult to get right, and also that, as part of the default search experience, it would have to be gotten right. So there are some risks involved with it from a product perspective. I'll get to your other points too, but i want to go on the record as saying that I am very happy with sourcegraph, even if canonical search dies here. I opened the issue knowing it was an impractical, extracurricular kind of thing. No matter what, I'll plan on adding value to sourcegraph in the practical way, too, that is with a PR!
I would appreciate some advice from you about the best good first issue
for me to work on. But that's a different conversation 😅.
Thanks for the follow-up @b-j-p! I think we're on the same page as to what the difficulties here are, and it's something that we'll continue to think about and see how we can get right. Glad to hear that you're happy with Sourcegraph.
As for good first issues, we have a label that you can look through and see what interests you: https://github.com/sourcegraph/sourcegraph/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22 All contributions are valued and appreciated! 😄
To your first point, @attfarhan, about popular npm packages
. I updated the image included with my feature request 👆 to give more context about how it looks when Sourcegraph makes this suggestion to the user. In short, yes, I tried it, and the user gets results back from a basic js query, which is pretty great. I remember this list of results being an especially cool one and looking a little bit like the one I'd want from a canonical search feature. In particular, it was fun to see what some of the projects included in this repogroup were.
However:
popular npm packages
, I remember being swamped by the richness of the results. Where I'd like to have gotten from querying your js-canon
one specimen from each project represented in the canon containing a match, I got multiple matches across an indefinitely large percentage of the projects in the repogroup. I was hoping to examine one species of little toad in the javascript ecosystem. I was querying a broad and eclectic repogroup to do that. But componentDidMount
isn't the kind of function that is used just once in a repository. So I found it all over the place and got lost. That's not too surprising. The data that we are talking about (in the land of opensource code) is so rich that it is easy to get lost in it. "swamped" isn't quite the right word, either. It's more like a jungle out there! And I don't even know if there are toads in the jungle! Ach, I'm getting lost in my imagery. Forget about toads, what counts is that there are going to be some fundamental functions in the js ecosystem, and I was trying with Sourcegraph to look into one of them.popular npm packages
, but which???, sniffing around, getting bitten by all kinds of cool looking mosquitos. I would very much have preferred, in this case, to be able to do some more surfing around the results, in and between projects in popular npm packages
. I wished for a cleaner and steadier surface provided me by a series of unique matches, one per project, which stay put when I click around in them. I never got the opportunity to ask and answer the questions: Which one of these popular npm packages employing this particular lifecycle method is my personal favorite? Which project has the nicest looking componentDidMount
for purposes of my own project?popular npm packages
suggestion. Sourcegraph will not make this suggestion to me every time I make an overly basic, unfiltered query that smells like javascript. The suggestion will be somewhere in the suggested filters area, or not. But wouldn't it be cool (forgetting about the query syntax that I came up with for a second) if Sourcegraph just detected an unfiltered, basic javascript query in the primary searchbox and piped it into the js-canon
automatically? That, in place of returning the “Whoops, our jungle is too jungley for this query" sort of response that I see all the time. Actually, if you have a js-canon
, I think you have the ability to do that. Because in order to represent a bunch of awesome javascript projects in a canon — by which I mean a kind of virtual super-repo — I think you would need to have a grip on a good number of the basic building blocks in the js ecosystem. But I will go into this some more when I address your last point, about curating canons.popular npm packages
worked with just a quick click to get me some fascinating results back, I never was querying the type of thing I wish I could be querying when making this sort of search.About the syntax, @attfarhan , your second point. I don't feel strongly about my proposal at all. As a matter of fact, I wish canonical search were the default "fast path" for Sourcegraph search on the web -- unfiltered exact match search, with no narrower default search context specified, gets piped right into the relevant canon. In which case, no new syntax is required and the syntax issue may be moot. The only problem is when you have a default narrower search context set up, as I do — I run a local instance of Sourcegraph against two private repos — then you would need a way syntactically to jump the bounds of the narrower context and aim a query at the greater jungle of OSS.
My suggestion [exact match pattern] |* [ecosystem abbreviation]
was added to try to mark the fact that there is a kind of break happening, here. With this search the user would NOT be querying a repo or a repogroup, but some third kind of thing, and I want to call this kind of thing a canon.
In particular:
Really ashamed to have to admit that...
Ok @attfarhan, just getting back from a vacation with the fam. My plan was to respond to your question about canon curation from the beach, but I never got around to taking out my laptop. I am feeling more dull-witted than I was before I left, to say the least, so you will have to forgive the job I do on this. Hopefully, you come away with an understanding of the general direction I would go with curation/construction of canons.
THE INPUT:
Begin with a repogroup much like popular npm packages
, let's call it seed
. What criteria does a project have to meet in order to make it into seed
? I don't know that for sure, and it might even be slightly different for different ecosystems. On the bright side: this is the only part of the curation process that is somewhat manual and dependent on judgement. And since the user will never be querying seed
directly, we don't have to get it just right. If I understood you correctly, 👆, much of the grief associated with curating repogroups like popular npm packages
stems from fact that Sourcegraph users are going to be querying it directly. This initial repogroup is not a filter on the user's search. It is just a seed for Sourcegraph itself to query. The important thing to note about the input to the curation process is that the repogroup be put together in such a way as to be able to be expanded/contracted automatically, and in a principled and consistent way , depending on what happens in the course of the curation process. Here are some criteria that we might want to use:
We want to be able to ask, and get a computer to answer, this question: if push comes to shove and we have to discard one, which one of these projects do we NOT want to represent in our canon? Well, all other things being equal, maybe we will discard the least popular project from seed
. Or this question: if we have to add another project to seed
to improve the quality of our canonical search of this ecosystem, which project should we add? If all other things are equal, well, maybe we will add that project, from a list of candidate projects that we automatically generate, the one that has the largest number of contributors. I don't know.
But here is a criterion that we would certainly want to use to expand/contract seed
:
I'll say more about this in one second 👇, but given an average richness number in the projects already in seed
, we can set a bar over which candidate projects could enter seed
, and also a bar below which projects in seed
automatically become subject to being replaced/discarded.
HANDLING RICHNESS Here is a sense of what I mean by richness and calculating a richness number. It's kind of the most important part of the canonical representation of an ecosystem. If we can't do something like this @attfarhan -- and what follows is a rough explanation -- I don't know if canonical search is possible at all.
So we have seed
, which is a repogroup 👆 , and with a repogroup comes a complete list of all symbols appearing in that repogroup. Actually, it would be REALLY great if we had something else in addition to the symbols to work with just here. But I will not discuss it, because, as far as I know, Sourcegraph does not possess that data yet. We already have the symbols, so I will rely on those. We iterate through the collection of symbols appearing in seed
and for each symbol we query seed
for that symbol. Of course, we want to analyze the results of each query, in a basic way, and record the results of the analysis. It's very basic stuff. We are looking for things like number and distribution of matches. If we get "too many matches" great! If we find a symbol that, while it may not appear a bunch of times, does appear in each project in seed
, well, great. We will identify in this way some of the fundamental symbols in the ecosystem. Call this list of symbols core
. core
will improve as seed
improves, but it never has to be a complete list of the core building blocks in the ecosystem, it just has to be a decent sample.
Notice that symbols like componentDidMount
will have an entry in core-js
. componentDidMount
will appear in seed-js
at least once. So we know we will be querying seed-js
for that symbol, just as we are doing for every single symbol that appears in seed-js
. When we do query seed-js
looking for componentDidMount
, we will get way too many matches. But, wait a second, that's exactly the sort of thing we want to see when querying seed
for one of seed
's symbols! When this happens we will go ahead and generate a record for componentDidMount
in core-js
.
Now the next step is to iterate through seed
going project-by-project, and, inside of each project, file-by-file. What we want is to be able to assign a richness number to each file in each project in seed
, the richness of a file being the measure of the density of the symbols appearing in that file which have an entry in core
. A project's richness number is the sum of the richness numbers of its files. The average richness number for a project in seed
would then be the sum of all the richness numbers of the projects in seed
divided by the total number of projects in it. A lot of the individual files in seed
will have a richness number that is actually or effectively nil. But the more of those the better. The files of nil richness will not appear in the canon. We will be extracting only the richest files from these projects in seed
. And, it's a jungle out there!, that will end up meaning that we can represent more projects in the canon that we are now curating. We simply expand seed
. This can only make the canonical search experience better, I think. More on this in my OUTPUT post to come.
At this point, we might want to check and see if there are projects in canon-candidates
that are richer than projects that we have currently in seed
. To do this we would go through the projects in canon-candidates
and using core
calculate richness numbers for each and see if seed
could use expansion/amendment.
If we can make seed
better, here, we do, and run the whole process over. We have a new seed
and so we get a new core
and new richness figures for each project. We keep the old seed
and the old richness figures so we can compare. Maybe we merge old-core
and new-core
and get a super-core
list of fundamental symbols and then measure old-seed
and new-seed
against each other, consult canon-candidates
, tweak old-seed
and new-seed
and make a decision between them. We could even go through this process until we reached a point where an old-seed
compared favorably to a new-seed
in terms of its richness. But this is not going to be the last opportunity to improve seed
.
I believe it might also be a good idea to rank the symbols appearing in core
in terms of their "importance". Importance for the ecosystem and/or importance for the canonical representation of the ecosystem. If we did that, it would be possible for file A, that contains just 1 match for 1 core
entry, to have the same richness number as a file, B, which has 2 matches for 2 different core
entries. This may not be necessary, and, anyway, it would be a lot easier to weight the symbols in core
with that other type of data I referred to above.
THE OUTPUT
Finally we arrive at the point in the curation process where we can construct a canon
! Or, because it is so novel for Sourcegraph to be not only ranging over naturally occurring projects, but also generating highly artificial projects for itself to range over, I'm going to talk as if Sourcegraph is doing all of this, not "us".
mkdir canon
canon
, for each project in seed
, Sourcegraph makes a representation
folderseed
👆seed
in order to represent that project for purposes of canonical search.seed
, the skimmings, Sourcegraph deposits a copy of the whole file into the home project's representation
folder inside of canon
. canon-package
file that sits at the root level of canon
. Or maybe there is a top level canon-package
that deals with data about the projects represented in canon
, and then, in each representation
folder, a project level project-package
with data about the particular files Sourecegraph is using to represent the project.git init
When we are ready, we expose canon
to the user's query, not seed
!! Instead of popular npm packages
, or anything like it that the user has curated by hand, it is js
, the javascript canon
, that gets hit with the query in a canonical search for componentDidMount
.
canon
is not a naturally occurring repo. Nor is it a repogroup. It's something else. It's a third kind of thing. It is another kind of search that we are enabling with canon
. If I can steal any more time I would like to talk about what is essentially different about the search experience you enable when you provide this third kind of queryable as a target for Sourcegraph search.
But we are not ready with canon
yet. At this point we need to test canon
, by querying it a bunch. So we have a terribly long list of exact match queries relevant to canon
that we hit canon
with. And we know what we want to be able to respond to many of these queries with: in the ideal case, it would be the location of one match in one file from each one of the representation
folders in canon
. But it will never be that good, and it doesn't have to be! Maybe we end up with 500 projects represented in canon
. If we could return 10-15 results, each one from a different representation
folder, hence a different project in the OSS jungle, a prime specimen from a dynamic open source project, I think we'd be happy. Results that you could click through -- > first, into the whole file, and then, if you are curious about this project --> into other matches from that representation
folder if there were any, and ultimately into the jungle where the code lives, --> into the actual project being represented in canon
. Yeah I, for one, would be really happy with that. Right now, suppose canon
is not capable of responding to enough exact match queries in this satisfactory manner. Well, we just augment seed
with more projects from canon-candidates
. This is when we make seed
really great. We increase the size of canon
thereby, and we run the spec again. Rinse and repeat. We do this until canon
gets over the bar that we set on the canonical search experience with our spec. Only then canon
is exposed to the Sourcegraph user's queries.
The commitment to the user in a canonical search feature is not to the freshness of the results -- that they reflect the latest commits, and so on. We want exemplary or representative results. Sourcegraph could recompile its canons
once a week. Do this, and overtime canonical search would get better because, for example, knowledge of what symbols belong in a core-js
improves. Knowledge of the relative weights of those symbols for purposes of canonical representation of the javascript ecosystem improves. Our richness numbers improve. We get better at extracting the richest files from the richest and most interesting OS projects. We get better at putting their riches at the fingertips of the user. We get better at making those riches more manageable. etc. Hopefully you get the picture.
Whew. I'm done 🙌
Closing this issue as we now have Search Contexts, where you can create your on lists of repositories to search through and share with others. More info here: https://about.sourcegraph.com/blog/introducing-search-contexts/
Feature request description
I wish I could make something like the following search from the primary search page:
componentDidMount |* js
Think of this search as piping a very basic query into a "javascript canon," as defined by Sourcegraph. I am piping an exact match query in this case, but it might be a type:symbol or a regex query as well.
Such canons might be compiled for each language: a living representation of repositories containing code that is found to be exemplary in that language or ecosystem -- on the basis of the number of github stars on the project, sure, but probably on the basis of some more sophisticated curation process. I don't know how many repositories you would need represented in a canon. But let us say 50, arbitrarily, and then let that have ramifications on the kind of results I get back from my search.
If all goes perfectly well -- imagine for a second that every project in the
js-canon
is a react project -- instead of getting this response as I do currently when I search forcomponentDidMount
:I now see a list of 50 results -- familiar looking results, pretty much just as you are used to seeing when searching across a single repository or repo-group. BUT! these results are substantive references to locations in files across the
js-canon
.In this idealized case, each item in the list of results would be a substantive reference to a single location in a single file from a different one of the fifty repositories represented in the
js-canon
.In principle, each result from the list would serve me as a model, or exemplar, of how this particular lifecycle method looks when it is used by the best teams, in the standard way. But not only that, each result would also serve me as a hook, if and when i click through, into one of the projects in the
js-canon
, which is full of interesting and exemplary code.Is your feature request related to a problem? If so, please describe.
Nope. No problems to report.
Describe alternatives you've considered.
Obviously, repo-groups exist! And I am very glad they do. The user is today free to put together a personally curated canonical group of repos for each language. But I am thinking of something quicker and more dynamic. Something built into the guts of Sourcegraph search and immediately at the finger tips of a new Sourcegraph user. Something smarter and more vigilant than we are, tuned to give us this type of result back.
I have considered a little how canons might be constructed, and structured, the benefits of different sizes of canons, etc. The interesting thing is that I believe this feature request is asking for one implementation of a more general kind of code search. If you conceive of canons as a sort of virtual super-repo compiled strategically behind the scenes by Sourcegraph for Sourcegraph to range over, you can begin to imagine compiling other canons for different, more specific aims. Ultimately, I wish I could come to Sourcegraph and surf the Sourcegraph canons, in addition to being able to interface in detail with this rich data as it is found "in the wild", which is what I think Sourcegraph is doing so well for me already.
Additional context
None