regexident / cargo-modules

Visualize/analyze a Rust crate's internal structure
Mozilla Public License 2.0
926 stars 47 forks source link

False positive for circularity #307

Open pdh11 opened 1 month ago

pdh11 commented 1 month ago

In this example:

pub mod foo {
    pub struct Foo {}
}

pub mod bar {
    use crate::foo::Foo;
    pub struct Fnord;

    pub trait Bar {
        fn send(&self, n: Fnord);
    }

    impl Bar for Foo {
        fn send(&self, _: Fnord) {}
    }
}

it seems pretty clear that the reasonable assessment is that module "bar" depends on module "foo", but not vice versa. Yet cargo modules dependencies --no-traits --no-types --no-fns --layout circo shows a cyclic "uses"-dependency: deps

Without the --no-traits --no-types --no-fns it's clear what happened: the trait in mod bar has a function which uses type bar::Fnord -- and that function is depicted as "owned by" foo just because it's implemented on a struct which is owned by foo: deps2

This is unhelpful, IMO: the example (distilled from a much larger one, of course) has in fact been written with all due principles of levelisability and non-circularity in mind, but cargo-modules (v0.16.6) produces a "false positive" for circularity.

regexident commented 1 month ago

Hi @pdh11 thanks for the bug report!

Without the --no-traits --no-types --no-fns it's clear what happened: the trait in mod bar has a function which uses type bar::Fnord -- and that function is depicted as "owned by" foo just because it's implemented on a struct which is owned by foo

You are absolutely right in that this is unexpected and feels wrong. This is caused by cargo-modules being somewhat impl-blind. Which is to say that while it does scan impls it does not represent them in the generated graph. As such like you already outlined methods end up getting incorrectly attributed to whatever module the self-type is declared in, instead of the actual module where the method's containing impl is found.

This is definitely a bug!

That said adding impl nodes to the graph isn't quite as simple as it might seem and for somewhat surprising reasons: testing.

The project heavily relies on snapshot testing as rust-analyzer's APIs are basically a black box and really hard to write any other kinds of tests against, while still keeping things maintainable.

Both, the snapshot tests, as well as GraphViz require nodes to have unique ids assigned to them. While effectively every object from rust-analyzer provides an identifier they tend to change between executions, which goes against out snapshotting needs. As such we currently use the node's fully qualified path Rust as its identifier, which already comes with a couple of weaknesses (e.g. not being able to distinguish between crate versions). But impls (and several other potentially interesting objects from rust-analyzer) don't even have an unambiguous path. There can even be multiple impl Foo {} within the same module, making it rather tricky to distinguish between them (especially when trying to avoid frequently changing snapshot files, such as if one was to include the entity's actual source code range in its id). As such I've so far avoided dealing with impl nodes.

For this to work properly —afaict— one would have to first add impl nodes (and find a stable way to identify impls in snapshot tests) and would then probably have to either generate a layered graph (where each layer would describe different semantics) and allow for selecting specific layers for output, or introduce more edge types, such as implements. Either way one would have to then introduce further or extend existing filtering capabilities to avoid not being able to see the ~forest~ graph for all the newly added ~trees~ edges. It's thorny business.

pdh11 commented 1 month ago

Sheer spitballing here -- I confess I haven't looked at how either cargo-modules or rust-analyzer actually work -- but wouldn't, say, the identifier full-path-of-trait +"-for-"+ full-path-of-struct be unique and stable? AIUI it's not legit to implement the same trait for the same struct more than once in the same program -- there could be multiple ones gated by different cfg features, but cargo-modules already works on just one feature-set at a time.

regexident commented 1 month ago

Trait impls aren't the real culprit here, impl Foo {} is, of which there can be arbitrarily many. You basically have to use the impl's source code location into account when disambiguating. Which now means you need to deal with file-paths (and if the impl was generated by a macro, then there actually isn't one, other than say the macro's call site, which in turn could have produced multiple impls, re-introducing ambiguity). And even if you're lucky and there is a code source with a corresponding file-path, then you need to obtain the relative path, as otherwise the test suite would generate different snapshots for each person running the tests. And in case of impls from extern crates the file will most-likely actually be in the global ~/.cargo/… cache.

I'm not saying it's impossible, but getting stable identifiers for nodes and edges has been one of the thorniest bits of cargo-modules. And that's without impl.

All the HIR types provided by rust-analyzer (and wrapped by Node) are Eq + Hash, so one could simply create ids for them by adding them to a id_map: HashMap<Node, NodeId> and deriving the corresponding NodeId from the current NodeId(id_map.len()). But that would mean that every time you run cargo modules dependencies … you get a completely different textual output (while it would —fingers crossed— remain visually stable), since each value of NodeId would be non-deterministic. This would make it impossible to use end-to-end snapshot testing. But without a thorough and extensive test suite a project like cargo-modules becomes unmaintainable very quickly. This became apparent very quickly early on.

Since the main problem is with the identifiers one simply use the id-map approach outlined above for the .dot output and could try adding a (probably undocumented) --test-snapshot CLI flag for testing, which instead of producing .dot (which requires unique IDs) returns a snapshot representation of certain graph properties (e.g. list of node labels, list of node in-degrees, list of node out-degrees, list of edge labels) without identifiers. This should result in stable snapshots, at the cost of losing some information in the snapshots, which might be acceptable though. There will be certain transformations however that simply won't be able to get caught with this compromise though.

The one main drawback from a user's perspective with this workaround is that diffing output of cargo modules dependencies … becomes a lost cause due to the non-determinism of ids and thus node/edge ordering.