kythe / kythe

Kythe is a pluggable, (mostly) language-agnostic ecosystem for building tools that work with code.
https://kythe.io
Apache License 2.0
1.95k stars 251 forks source link

Support branch/ref/revision in cross-references #4181

Open salguarnieri opened 4 years ago

salguarnieri commented 4 years ago

Add support for xrefs inside a particular branch or revision. One possible solution is to add a revision field to the vname.

creachadair commented 4 years ago

Adding revision to the vname is probably not the right approach: Then the vname of an object will differ based on which revisions it appears in. You could postprocess that out to collapse the graph back to what we have now, but adding this to every vname in the system only to throw it away could be a big data-size issue. VNames are probably the majority of the graph data, prior to normalization.

I believe we already have information about build configuration associated with file nodes. Revision and branch metadata would be good candidates for similar treatment: That way, you can filter xref anchors to restrict the view, without screwing up the association between semantic objects.

This approach doesn't solve queries like "show me relationships that only existed in the graph at this revision", but to do that you basically have to partition the whole graph anyway. I speculate that this query is not all that interesting to users, as compared to "show me only the references that existed in files at this revision".

jaysachs commented 4 years ago

Adding revision to the vname is probably not the right approach I wouldn't say "probably not", but I'd agree with "needs significant discussion".

We already have a relatively baked-in assumption that along the time axis of a single branch, "unchanging" symbols retain their identity. A philosophical question: should the same assumption hold across branches? Are branches really more like separate "corpora" namespaces?

This approach doesn't solve queries like "show me relationships that only existed in the graph at this revision", but to do that you basically have to partition the whole graph anyway. I speculate that this query is not all that interesting to users, as compared to "show me only the references that existed in files at this revision".

I agree with your suggestion that data across branches are (generally) not going to be commingled. (That's not to say I can't imagine interesting use cases for it.)

One difference between build configuration and branches is that the "interesting" queries were "show me all references for all build configurations, tagged/grouped by config". The canonical case for this is decorations for C++ files with multiple build configurations exposed on cs.chromium.org; decorations from all build configurations should be shown. For branches, the use cases we're looking at would be that branch would always be a "filter". I'm not saying this suggests one representation over another, but wanted to get requirements out there so we have an informed discussion.

We should also probably mention another possibility thrown around, which would cement the "distinctness" across branches: encode the branch in a structured corpus name. If this was the general approach taken, I'd definitely prefer a separate field to that.

creachadair commented 4 years ago

We already have a relatively baked-in assumption that along the time axis of a single branch, "unchanging" symbols retain their identity. A philosophical question: should the same assumption hold across branches? Are branches really more like separate "corpora" namespaces?

That's an important design question. In my view yes: And in that case I would fold revision information into the corpus label, if it's relevant—or possibly the root, depending on the layout. The point being that if the identity does depend on the revision, it's really not the same corpus as a different revision of the same files, even if they are superficially similar.

I think in most cases that's not really what users expect, though: The idea that F isn't the same function just because I changed something about G has historically confused people.

I agree with your suggestion that data across branches are (generally) not going to be commingled. (That's not to say I can't imagine interesting use cases for it.)

Agreed.

One difference between build configuration and branches is that the "interesting" queries were "show me all references for all build configurations, tagged/grouped by config". The canonical case for this is decorations for C++ files with multiple build configurations exposed on cs.chromium.org; decorations from all build configurations should be shown. For branches, the use cases we're looking at would be that branch would always be a "filter". I'm not saying this suggests one representation over another, but wanted to get requirements out there so we have an informed discussion.

Yes. And for C++ in particular, compiler flags may have wide-ranging effects on the shape of the graph, in a way that isn't particularly common to other languages.

We should also probably mention another possibility thrown around, which would cement the "distinctness" across branches: encode the branch in a structured corpus name. If this was the general approach taken, I'd definitely prefer a separate field to that.

It depends, I think, on whether the two are ever really separable: If you can't meaningfully change the revision without essentially updating all your code, you might as well consider it a separate corpus of files: The fact that it's stored in the same repository is just a compression scheme at that point.

In practice, though, I think many (most?) revision changes aren't that substantial.

In any case, while it certainly is possible to add version tags to the VName, I would recommend being very cautious about going down that road, as the effects are pretty substantial.

shahms commented 4 years ago

It's also possible to introduce a new node kind (or subkind), or incorporate revision into the file path, rather than the corpus.

creachadair commented 4 years ago

It's also possible to introduce a new node kind (or subkind), or incorporate revision into the file path, rather than the corpus.

Yes, and anything that is item-per-file is much less space overhead than item-per-vname. So if that serves the queries you care about, it's probably a better solution. Whether it's a separate node of its own (in which case maybe you could query for "everything with a version tag that looks kind of like this") or a property of the file, is probably up to how much preprocessing you want to do before serving it.