Open jtibshirani opened 9 months ago
This might have implications for search jobs, where we only support searching for content right now. This is likely not a big deal but we should keep it in mind.
I just came across this as well but from a different angle while testing the 5.3.0-rc.1 image. zoekt sourcegraph frontend -f:grpc repo:^github\.com/sourcegraph/zoekt$
says there are "4 path" results. When I filter by that there are no results. Then when I click move to query there are still no results. This is quite a frustrating bug tbh. I wonder if the new filter panel is incorrectly calculating the numbers on the right? cc @camdencheek
@keegancsmith this is an interesting one! The search you linked has three terms: zoekt
, sourcegraph
, and frontend
. We determine the count for type:path
by checking whether there are any matched ranges in the path name. However, the path highlights we return only ever match two of the terms. The third (frontend
) only matches in the file body. So, when we select type:path
, nothing matches because only 2 of the 3 terms are ever matched the path.
One way to interpret this is that type:file
and type:path
are already not actually independent types. If we go ahead with the "merge type:file
and type:path
", this gets solved naturally.
This does bring up an interesting point though: should sourcegraph frontend zoekt
also match against the repo name a file belongs to? And if it does, even if we merge type:file
and type:path
, we still have the same issue of independence between result types.
It almost feels like we need a more "hierarchical" data model for search. I don't really know what that means concretely, but our flat result types don't really make as much sense if we are searching across multiple layers of the hierarchy by default.
This does bring up an interesting point though: should
sourcegraph frontend zoekt
also match against the repo name a file belongs to?
That is something we want to experiment with.
And if it does, even if we merge
type:file
andtype:path
, we still have the same issue of independence between result types.
We discussed this a bit in our sync today. I can't remember exactly the outcome lol, but I do remember Julie saying she was gonna update her view point here :)
It almost feels like we need a more "hierarchical" data model for search. I don't really know what that means concretely, but our flat result types don't really make as much sense if we are searching across multiple layers of the hierarchy by default.
Agreed. Would be great if we only had 1 result type and we had the backend could evolve to return lots of different sources.
(side note: I've always wanted to experiment with tree-shaped results. Like, in the UI, the ability to expand repo > commit > directory > file and see aggregated counts at each level. hackathon project?)
I thought about this more and came back to my original opinion :) Here is the way I'm thinking about it. Results are like "entities" in our system. Entities have types and fields, and these are separate things. There are really only a few types:
Then each of these have fields: repositories have name, tags, and description, whereas files have file path, contents, and the name of the repository they belong to. For me, it's confusing that we consider "path" to be an entity type, whereas it would be better understood as a field. Similarly, "repo name" could just be a field on a "file" entity.
So in this ideal world
path:...
, content:...
I'll have to ponder the hierarchical data model! I suspect we could go quite far sticking with our "denormalized" model, and it matches what a lot of people expect (including GH search 😊) What do you all think??
There are really only a few types:
I would add "commit" to this list as well, which also can be a tricky one since there are some tangled relationships between commits and files.
So in this ideal world
Fully on board. This strikes me as easy to build a mental model around.
I'll have to ponder the hierarchical data model!
No need to go too far down that rabbit hole 🙂 Honestly, I think it's more of an interesting thought experiment than a realistic path forward.
Just came across this (very old) issue in our backlog, and I thought it was interesting to share given the conversation here about hierarchical results
I agree that path and content should be the same type. Main motivation is it is how our default search works and indeed they are the same core entity. More motivation:
file:
and content:
respectively).select:file
and select:content
respectively).
Currently,
type:file
only matches file contents. This creates a surprising experience with the new filters panel: by default we match across file contents and name (since we search bothtype:file
andtype:path
), but then when you select "Code" we only match contents.It'd make sense to update
type:file
to also search the file path. The reasoning is that searching for files should consider all data associated with the file, including important fields like its name./cc @sourcegraph/search-platform