Search: `type:file` filter should also match file paths

jtibshirani commented 9 months ago

Currently, type:file only matches file contents. This creates a surprising experience with the new filters panel: by default we match across file contents and name (since we search both type:file and type:path), but then when you select "Code" we only match contents.

It'd make sense to update type:file to also search the file path. The reasoning is that searching for files should consider all data associated with the file, including important fields like its name.

/cc @sourcegraph/search-platform

stefanhengl commented 9 months ago

This might have implications for search jobs, where we only support searching for content right now. This is likely not a big deal but we should keep it in mind.

keegancsmith commented 9 months ago

I just came across this as well but from a different angle while testing the 5.3.0-rc.1 image. zoekt sourcegraph frontend -f:grpc repo:^github\.com/sourcegraph/zoekt$ says there are "4 path" results. When I filter by that there are no results. Then when I click move to query there are still no results. This is quite a frustrating bug tbh. I wonder if the new filter panel is incorrectly calculating the numbers on the right? cc @camdencheek

camdencheek commented 9 months ago

@keegancsmith this is an interesting one! The search you linked has three terms: zoekt, sourcegraph, and frontend. We determine the count for type:path by checking whether there are any matched ranges in the path name. However, the path highlights we return only ever match two of the terms. The third (frontend) only matches in the file body. So, when we select type:path, nothing matches because only 2 of the 3 terms are ever matched the path.

One way to interpret this is that type:file and type:path are already not actually independent types. If we go ahead with the "merge type:file and type:path", this gets solved naturally.

camdencheek commented 9 months ago

This does bring up an interesting point though: should sourcegraph frontend zoekt also match against the repo name a file belongs to? And if it does, even if we merge type:file and type:path, we still have the same issue of independence between result types.

It almost feels like we need a more "hierarchical" data model for search. I don't really know what that means concretely, but our flat result types don't really make as much sense if we are searching across multiple layers of the hierarchy by default.

keegancsmith commented 9 months ago

This does bring up an interesting point though: should sourcegraph frontend zoekt also match against the repo name a file belongs to?

That is something we want to experiment with.

And if it does, even if we merge type:file and type:path, we still have the same issue of independence between result types.

We discussed this a bit in our sync today. I can't remember exactly the outcome lol, but I do remember Julie saying she was gonna update her view point here :)

It almost feels like we need a more "hierarchical" data model for search. I don't really know what that means concretely, but our flat result types don't really make as much sense if we are searching across multiple layers of the hierarchy by default.

Agreed. Would be great if we only had 1 result type and we had the backend could evolve to return lots of different sources.

camdencheek commented 9 months ago

(side note: I've always wanted to experiment with tree-shaped results. Like, in the UI, the ability to expand repo > commit > directory > file and see aggregated counts at each level. hackathon project?)

jtibshirani commented 9 months ago

I thought about this more and came back to my original opinion :) Here is the way I'm thinking about it. Results are like "entities" in our system. Entities have types and fields, and these are separate things. There are really only a few types:

Repositories
Files
Symbol definitions

Then each of these have fields: repositories have name, tags, and description, whereas files have file path, contents, and the name of the repository they belong to. For me, it's confusing that we consider "path" to be an entity type, whereas it would be better understood as a field. Similarly, "repo name" could just be a field on a "file" entity.

So in this ideal world

We'd only search one type at a time, so search results can easily be compared to each other and we can craft a great search experience for each type
We'd search files by default, and match across all relevant fields (contents, path, repo name)
If you want to only match on paths or content, you would use the "field filter" syntax like path:..., content:...

I'll have to ponder the hierarchical data model! I suspect we could go quite far sticking with our "denormalized" model, and it matches what a lot of people expect (including GH search 😊) What do you all think??

camdencheek commented 9 months ago

There are really only a few types:

I would add "commit" to this list as well, which also can be a tricky one since there are some tangled relationships between commits and files.

So in this ideal world

Fully on board. This strikes me as easy to build a mental model around.

I'll have to ponder the hierarchical data model!

No need to go too far down that rabbit hole 🙂 Honestly, I think it's more of an interesting thought experiment than a realistic path forward.

camdencheek commented 9 months ago

Just came across this (very old) issue in our backlog, and I thought it was interesting to share given the conversation here about hierarchical results

keegancsmith commented 7 months ago

I agree that path and content should be the same type. Main motivation is it is how our default search works and indeed they are the same core entity. More motivation:

We have ways to only match path or content (file: and content: respectively).
We have a way to transform the entity into path or content (select:file and select:content respectively).
Our clients highlight the paths via the query, not via what our API returns. IE we work around the fact that path matches are weird in our current API.

sourcegraph / sourcegraph-public-snapshot

Search: `type:file` filter should also match file paths #60338