canonical search - Githubissues

sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody

https://sourcegraph.com

Other

10.12k stars 1.29k forks source link

canonical search #2607

Closed b-j-p closed 3 years ago

b-j-p commented 5 years ago

Feature request description

I wish I could make something like the following search from the primary search page:

componentDidMount |* js

Think of this search as piping a very basic query into a "javascript canon," as defined by Sourcegraph. I am piping an exact match query in this case, but it might be a type:symbol or a regex query as well.

Such canons might be compiled for each language: a living representation of repositories containing code that is found to be exemplary in that language or ecosystem -- on the basis of the number of github stars on the project, sure, but probably on the basis of some more sophisticated curation process. I don't know how many repositories you would need represented in a canon. But let us say 50, arbitrarily, and then let that have ramifications on the kind of results I get back from my search.

If all goes perfectly well -- imagine for a second that every project in the js-canon is a react project -- instead of getting this response as I do currently when I search for componentDidMount:

I now see a list of 50 results -- familiar looking results, pretty much just as you are used to seeing when searching across a single repository or repo-group. BUT! these results are substantive references to locations in files across the js-canon.

In this idealized case, each item in the list of results would be a substantive reference to a single location in a single file from a different one of the fifty repositories represented in the js-canon.

In principle, each result from the list would serve me as a model, or exemplar, of how this particular lifecycle method looks when it is used by the best teams, in the standard way. But not only that, each result would also serve me as a hook, if and when i click through, into one of the projects in the js-canon, which is full of interesting and exemplary code.

Is your feature request related to a problem? If so, please describe.

Nope. No problems to report.

Describe alternatives you've considered.

Obviously, repo-groups exist! And I am very glad they do. The user is today free to put together a personally curated canonical group of repos for each language. But I am thinking of something quicker and more dynamic. Something built into the guts of Sourcegraph search and immediately at the finger tips of a new Sourcegraph user. Something smarter and more vigilant than we are, tuned to give us this type of result back.

I have considered a little how canons might be constructed, and structured, the benefits of different sizes of canons, etc. The interesting thing is that I believe this feature request is asking for one implementation of a more general kind of code search. If you conceive of canons as a sort of virtual super-repo compiled strategically behind the scenes by Sourcegraph for Sourcegraph to range over, you can begin to imagine compiling other canons for different, more specific aims. Ultimately, I wish I could come to Sourcegraph and surf the Sourcegraph canons, in addition to being able to interface in detail with this rich data as it is found "in the wild", which is what I think Sourcegraph is doing so well for me already.

Additional context

None

attfarhan commented 5 years ago

Hey @b-j-p, sorry for the very late reply here! Thanks for suggesting this feature and for the detailed writeup. We're thinking about how to improve the default search experience on Sourcegraph.com, so this is very useful.

A couple questions come to mind:

Did you notice the "Popular npm packages" repogroup? This was an attempt to curate a group of JS repositories that are high quality, almost as you describe above. If you did notice and try it, what about that repogroup falls short for your purposes? If not, how could we make it more obvious for you? (if you have the chance to try it, I'd still love to hear your answer for how it may fall short)
Are you requesting the syntax for searching over curated canons/repogroups be slightly simpler (as in your componentDidMount |* js example)?
Do you have any suggestions for curating these canons? We've found that it can be challenging to create repogroups that are relevant to everyone, even people within the same community. A difficulty we've seen in the past, is that when we provide a JS canon, a user could search for something 100% valid, but a bit obscure, and that may not show up in the canon. This makes the user think "well, Sourcegraph doesn't work at all" and is still a bad experience.

Let me know if there's anything I'm missing about your proposal! And again, sorry for the late response.

b-j-p commented 5 years ago

Hello @attfarhan. Thanks for getting back to me about this. No apologies, please! I'm just glad to be able to help.

Yes. To your last point: I admit that a canonical search feature would be difficult to get right, and also that, as part of the default search experience, it would have to be gotten right. So there are some risks involved with it from a product perspective. I'll get to your other points too, but i want to go on the record as saying that I am very happy with sourcegraph, even if canonical search dies here. I opened the issue knowing it was an impractical, extracurricular kind of thing. No matter what, I'll plan on adding value to sourcegraph in the practical way, too, that is with a PR!

I would appreciate some advice from you about the best good first issue for me to work on. But that's a different conversation 😅.

attfarhan commented 5 years ago

Thanks for the follow-up @b-j-p! I think we're on the same page as to what the difficulties here are, and it's something that we'll continue to think about and see how we can get right. Glad to hear that you're happy with Sourcegraph.

As for good first issues, we have a label that you can look through and see what interests you: https://github.com/sourcegraph/sourcegraph/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22 All contributions are valued and appreciated! 😄

b-j-p commented 5 years ago

To your first point, @attfarhan, about popular npm packages. I updated the image included with my feature request 👆 to give more context about how it looks when Sourcegraph makes this suggestion to the user. In short, yes, I tried it, and the user gets results back from a basic js query, which is pretty great. I remember this list of results being an especially cool one and looking a little bit like the one I'd want from a canonical search feature. In particular, it was fun to see what some of the projects included in this repogroup were.

However:

When I queried popular npm packages, I remember being swamped by the richness of the results. Where I'd like to have gotten from querying your js-canon one specimen from each project represented in the canon containing a match, I got multiple matches across an indefinitely large percentage of the projects in the repogroup. I was hoping to examine one species of little toad in the javascript ecosystem. I was querying a broad and eclectic repogroup to do that. But componentDidMount isn't the kind of function that is used just once in a repository. So I found it all over the place and got lost. That's not too surprising. The data that we are talking about (in the land of opensource code) is so rich that it is easy to get lost in it. "swamped" isn't quite the right word, either. It's more like a jungle out there! And I don't even know if there are toads in the jungle! Ach, I'm getting lost in my imagery. Forget about toads, what counts is that there are going to be some fundamental functions in the js ecosystem, and I was trying with Sourcegraph to look into one of them.
Related to the feeling of being bejungled by Sourcegraph: when querying this repogroup I remember feeling that I was too quickly separated from my search results. I was just being a little curious, and suddenly I find myself deep inside one of the projects in popular npm packages, but which???, sniffing around, getting bitten by all kinds of cool looking mosquitos. I would very much have preferred, in this case, to be able to do some more surfing around the results, in and between projects in popular npm packages. I wished for a cleaner and steadier surface provided me by a series of unique matches, one per project, which stay put when I click around in them. I never got the opportunity to ask and answer the questions: Which one of these popular npm packages employing this particular lifecycle method is my personal favorite? Which project has the nicest looking componentDidMount for purposes of my own project?
So I ended up doing the bare exact match search a number of times, and it seemed from the perspective of the user to be a matter of luck if Sourcegraph made the popular npm packages suggestion. Sourcegraph will not make this suggestion to me every time I make an overly basic, unfiltered query that smells like javascript. The suggestion will be somewhere in the suggested filters area, or not. But wouldn't it be cool (forgetting about the query syntax that I came up with for a second) if Sourcegraph just detected an unfiltered, basic javascript query in the primary searchbox and piped it into the js-canon automatically? That, in place of returning the “Whoops, our jungle is too jungley for this query" sort of response that I see all the time. Actually, if you have a js-canon, I think you have the ability to do that. Because in order to represent a bunch of awesome javascript projects in a canon — by which I mean a kind of virtual super-repo — I think you would need to have a grip on a good number of the basic building blocks in the js ecosystem. But I will go into this some more when I address your last point, about curating canons.
In the end, however interesting and well-curated it may be, a repogroup is not a canon. And no matter how well popular npm packages worked with just a quick click to get me some fascinating results back, I never was querying the type of thing I wish I could be querying when making this sort of search.

b-j-p commented 5 years ago

About the syntax, @attfarhan , your second point. I don't feel strongly about my proposal at all. As a matter of fact, I wish canonical search were the default "fast path" for Sourcegraph search on the web -- unfiltered exact match search, with no narrower default search context specified, gets piped right into the relevant canon. In which case, no new syntax is required and the syntax issue may be moot. The only problem is when you have a default narrower search context set up, as I do — I run a local instance of Sourcegraph against two private repos — then you would need a way syntactically to jump the bounds of the narrower context and aim a query at the greater jungle of OSS.

My suggestion [exact match pattern] |* [ecosystem abbreviation] was added to try to mark the fact that there is a kind of break happening, here. With this search the user would NOT be querying a repo or a repogroup, but some third kind of thing, and I want to call this kind of thing a canon.

In particular:

Whatever you want to call this type of Sourcegraph queryable, this tertium quid, it is going to be the product of a process of "curation" or “canonical representation”, which I will try to describe in my next comment (but this is by far the hardest thing you asked of me 😅). The result of this algorithmic curation process would be like a repo, but it could not possibly compile for example. The code has been removed from the jungle where it lives and taken into a sort of zoo.
Canons would serve Sourcegraph users such as myself as a source of results that are representative of a particular OSS ecosystem. Sort of like a canon in literature: since you cannot possibly read all of English literature, if you want mastery and a large view of the subject, then you need to be trained on a canon.
The shameful consideration informing my syntax 😬 was that I thought it would be fun if the query of a canon sort of looked like a cannon firing a projectile.

Really ashamed to have to admit that...

b-j-p commented 5 years ago

Ok @attfarhan, just getting back from a vacation with the fam. My plan was to respond to your question about canon curation from the beach, but I never got around to taking out my laptop. I am feeling more dull-witted than I was before I left, to say the least, so you will have to forgive the job I do on this. Hopefully, you come away with an understanding of the general direction I would go with curation/construction of canons.

THE INPUT: Begin with a repogroup much like popular npm packages, let's call it seed. What criteria does a project have to meet in order to make it into seed? I don't know that for sure, and it might even be slightly different for different ecosystems. On the bright side: this is the only part of the curation process that is somewhat manual and dependent on judgement. And since the user will never be querying seed directly, we don't have to get it just right. If I understood you correctly, 👆, much of the grief associated with curating repogroups like popular npm packages stems from fact that Sourcegraph users are going to be querying it directly. This initial repogroup is not a filter on the user's search. It is just a seed for Sourcegraph itself to query. The important thing to note about the input to the curation process is that the repogroup be put together in such a way as to be able to be expanded/contracted automatically, and in a principled and consistent way , depending on what happens in the course of the curation process. Here are some criteria that we might want to use:

popularity, or number of github stars
number of contributors
number of dependent projects in the ecosystem (or perhaps the number of important dependents??? Here there is an analogy with the original Google internet search algorithm if I am not mistaken)
level of activity in the project

We want to be able to ask, and get a computer to answer, this question: if push comes to shove and we have to discard one, which one of these projects do we NOT want to represent in our canon? Well, all other things being equal, maybe we will discard the least popular project from seed. Or this question: if we have to add another project to seed to improve the quality of our canonical search of this ecosystem, which project should we add? If all other things are equal, well, maybe we will add that project, from a list of candidate projects that we automatically generate, the one that has the largest number of contributors. I don't know.

But here is a criterion that we would certainly want to use to expand/contract seed:

a "richness" number that we calculate in the course of the curation process.

I'll say more about this in one second 👇, but given an average richness number in the projects already in seed, we can set a bar over which candidate projects could enter seed, and also a bar below which projects in seed automatically become subject to being replaced/discarded.

b-j-p commented 5 years ago

HANDLING RICHNESS Here is a sense of what I mean by richness and calculating a richness number. It's kind of the most important part of the canonical representation of an ecosystem. If we can't do something like this @attfarhan -- and what follows is a rough explanation -- I don't know if canonical search is possible at all.

So we have seed, which is a repogroup 👆 , and with a repogroup comes a complete list of all symbols appearing in that repogroup. Actually, it would be REALLY great if we had something else in addition to the symbols to work with just here. But I will not discuss it, because, as far as I know, Sourcegraph does not possess that data yet. We already have the symbols, so I will rely on those. We iterate through the collection of symbols appearing in seed and for each symbol we query seed for that symbol. Of course, we want to analyze the results of each query, in a basic way, and record the results of the analysis. It's very basic stuff. We are looking for things like number and distribution of matches. If we get "too many matches" great! If we find a symbol that, while it may not appear a bunch of times, does appear in each project in seed, well, great. We will identify in this way some of the fundamental symbols in the ecosystem. Call this list of symbols core. core will improve as seed improves, but it never has to be a complete list of the core building blocks in the ecosystem, it just has to be a decent sample.

Notice that symbols like componentDidMount will have an entry in core-js. componentDidMount will appear in seed-js at least once. So we know we will be querying seed-js for that symbol, just as we are doing for every single symbol that appears in seed-js. When we do query seed-js looking for componentDidMount, we will get way too many matches. But, wait a second, that's exactly the sort of thing we want to see when querying seed for one of seed's symbols! When this happens we will go ahead and generate a record for componentDidMount in core-js.

Now the next step is to iterate through seed going project-by-project, and, inside of each project, file-by-file. What we want is to be able to assign a richness number to each file in each project in seed, the richness of a file being the measure of the density of the symbols appearing in that file which have an entry in core. A project's richness number is the sum of the richness numbers of its files. The average richness number for a project in seed would then be the sum of all the richness numbers of the projects in seed divided by the total number of projects in it. A lot of the individual files in seed will have a richness number that is actually or effectively nil. But the more of those the better. The files of nil richness will not appear in the canon. We will be extracting only the richest files from these projects in seed. And, it's a jungle out there!, that will end up meaning that we can represent more projects in the canon that we are now curating. We simply expand seed. This can only make the canonical search experience better, I think. More on this in my OUTPUT post to come.

At this point, we might want to check and see if there are projects in canon-candidates that are richer than projects that we have currently in seed. To do this we would go through the projects in canon-candidates and using core calculate richness numbers for each and see if seed could use expansion/amendment.

If we can make seed better, here, we do, and run the whole process over. We have a new seed and so we get a new core and new richness figures for each project. We keep the old seed and the old richness figures so we can compare. Maybe we merge old-core and new-core and get a super-core list of fundamental symbols and then measure old-seed and new-seed against each other, consult canon-candidates, tweak old-seed and new-seed and make a decision between them. We could even go through this process until we reached a point where an old-seed compared favorably to a new-seed in terms of its richness. But this is not going to be the last opportunity to improve seed.

I believe it might also be a good idea to rank the symbols appearing in core in terms of their "importance". Importance for the ecosystem and/or importance for the canonical representation of the ecosystem. If we did that, it would be possible for file A, that contains just 1 match for 1 core entry, to have the same richness number as a file, B, which has 2 matches for 2 different core entries. This may not be necessary, and, anyway, it would be a lot easier to weight the symbols in core with that other type of data I referred to above.

b-j-p commented 5 years ago

THE OUTPUT Finally we arrive at the point in the curation process where we can construct a canon! Or, because it is so novel for Sourcegraph to be not only ranging over naturally occurring projects, but also generating highly artificial projects for itself to range over, I'm going to talk as if Sourcegraph is doing all of this, not "us".

Here, Sourcegraph performs the long awaited mkdir canon
Inside of canon, for each project in seed, Sourcegraph makes a representation folder
Sourcegraph scans the richness numbers it has generated for seed 👆
It extracts the n richest files (a small fraction), the files that it needs from each project in seed in order to represent that project for purposes of canonical search.
For each of these files from seed, the skimmings, Sourcegraph deposits a copy of the whole file into the home project's representation folder inside of canon.
For each file Sourcegraph retains a jungle address, all metadata from its place in the wild. Maybe Sourcegraph records this data in a sort of canon-package file that sits at the root level of canon. Or maybe there is a top level canon-package that deals with data about the projects represented in canon, and then, in each representation folder, a project level project-package with data about the particular files Sourecegraph is using to represent the project.
then git init

When we are ready, we expose canon to the user's query, not seed!! Instead of popular npm packages, or anything like it that the user has curated by hand, it is js, the javascript canon, that gets hit with the query in a canonical search for componentDidMount.

canon is not a naturally occurring repo. Nor is it a repogroup. It's something else. It's a third kind of thing. It is another kind of search that we are enabling with canon. If I can steal any more time I would like to talk about what is essentially different about the search experience you enable when you provide this third kind of queryable as a target for Sourcegraph search.

But we are not ready with canon yet. At this point we need to test canon, by querying it a bunch. So we have a terribly long list of exact match queries relevant to canon that we hit canon with. And we know what we want to be able to respond to many of these queries with: in the ideal case, it would be the location of one match in one file from each one of the representation folders in canon. But it will never be that good, and it doesn't have to be! Maybe we end up with 500 projects represented in canon. If we could return 10-15 results, each one from a different representation folder, hence a different project in the OSS jungle, a prime specimen from a dynamic open source project, I think we'd be happy. Results that you could click through -- > first, into the whole file, and then, if you are curious about this project --> into other matches from that representation folder if there were any, and ultimately into the jungle where the code lives, --> into the actual project being represented in canon. Yeah I, for one, would be really happy with that. Right now, suppose canon is not capable of responding to enough exact match queries in this satisfactory manner. Well, we just augment seed with more projects from canon-candidates. This is when we make seed really great. We increase the size of canon thereby, and we run the spec again. Rinse and repeat. We do this until canon gets over the bar that we set on the canonical search experience with our spec. Only then canon is exposed to the Sourcegraph user's queries.

The commitment to the user in a canonical search feature is not to the freshness of the results -- that they reflect the latest commits, and so on. We want exemplary or representative results. Sourcegraph could recompile its canons once a week. Do this, and overtime canonical search would get better because, for example, knowledge of what symbols belong in a core-js improves. Knowledge of the relative weights of those symbols for purposes of canonical representation of the javascript ecosystem improves. Our richness numbers improve. We get better at extracting the richest files from the richest and most interesting OS projects. We get better at putting their riches at the fingertips of the user. We get better at making those riches more manageable. etc. Hopefully you get the picture.

Whew. I'm done 🙌

limitedmage commented 3 years ago

Closing this issue as we now have Search Contexts, where you can create your on lists of repositories to search through and share with others. More info here: https://about.sourcegraph.com/blog/introducing-search-contexts/