rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
94.9k stars 12.23k forks source link

Cannot detect glob re-export's shadowed items in rustdoc JSON #111338

Open obi1kenobi opened 1 year ago

obi1kenobi commented 1 year ago

I tried this code:

mod inner {
    pub struct Foo {};
}

mod other {
    struct Foo;
}

pub use inner::*;

// Adding either of the following two lines
// would shadow `inner::Foo` and
// hide the name `Foo` from the public API.
//
// (1)
// struct Foo;
//
// (2)
// use other::Foo;

I expected to see this happen: when either line (1) or (2) is added, rustdoc JSON files should offer some way to detect that the name Foo is no longer part of the crate's public API.

Instead, this happened:

Just like using a private type in (1), the same problem can be caused by using a #[doc(hidden)] public type, which will also not be emitted in rustdoc JSON unless --document-hidden-items is set.

This means that shadowing glob re-exported items is in general currently impossible to reliably detect via rustdoc JSON, even with --document-private-items and --document-hidden-items. As a result, cargo-semver-checks cannot detect breaking changes caused by such shadowing, which have happened in real life: around a year ago, the opencl3 crate seems to have suffered a regression when a large number of its items were accidentally shadowed.

Relatedly, in #111336 I'm arguing that such shadowing should trigger a lint since it's essentially never what one wants: anything it can productively accomplish can be better accomplished in another way.

I plan to publish a blog post with more details about this in a few hours, it will be available at the following link: https://predr.ag/blog/breaking-semver-in-rust-by-adding-private-type-or-import/

Meta

rustc --version --verbose:

rustc 1.71.0-nightly (39c6804b9 2023-04-19)
binary: rustc
commit-hash: 39c6804b92aa202369e402525cee329556bc1db0
commit-date: 2023-04-19
host: x86_64-unknown-linux-gnu
release: 1.71.0-nightly
LLVM version: 16.0.2

That version of rustc emits rustdoc JSON format version v24.

Possible solution

One option I thought of would be to tweak the rustdoc format to include a list of shadowed names in glob re-exports. This would allow proper name resolution in all cases, and without requiring --document-private-items and --document-hidden-items to be set.

Another option would be to include non-pub imports when --document-private-items is set. This only allows proper name resolution in rustdoc JSON generated with both --document-private-items and --document-hidden-items set.

Perhaps there are other options as well.

Personally, I'd prefer the former option over the latter, since it seems reasonable to expect that rustdoc JSON shouldn't require flags that document explicitly non-public-API components in order to fully describe the public API.

obi1kenobi commented 1 year ago

@rustbot label +T-rustdoc +A-rustdoc-json

obi1kenobi commented 1 year ago

Having done an in-depth exploration of alternatives, requiring rustdoc consumers to do name resolution is extremely not practical. I firmly believe that making glob reexports include a list of shadowed names for each namespace (specifically, "these names look like they should be imported but they aren't usable" which includes imported-but-ambiguous names) is the right design option here.

Properly resolving shadowing is a job for rustc. Otherwise one has to reimplement a substantial chunk of rustc to get it right:

Here's an example that requires all of the above: equivalent playground example

mod first {
    mod a {
        pub(crate) struct Foo(
            // Change `pub(super) bool` to `pub(crate) bool`
            // to cause a breaking change.
            //
            // Just swap which line is commented out and which is included:
            pub(super) bool  // this is fine
            // pub(crate) bool  // this is a breaking change
        );
    }

    mod b {
        pub(crate) struct Foo{}
    }

    pub(crate) use a::*;
    pub(crate) use b::*;  
}

mod second {
    pub struct Foo();
}

use first::*;  // *** NOT A `pub use` ***
pub use second::*;

This crate exports Foo in the values namespace, as the implicit constructor function of second::Foo — and that's it, all other names are unusable due to ambiguity.

If the field at first::a::Foo becomes pub(crate), then:

Even if rustdoc included info about private imports (necessary to determine shadowing), the logic is obscenely difficult to get right. cargo-semver-checks has made me very comfortable with the rustdoc format; even so, I've spent two weeks trying and failing to get a working implementation that nails all the edge cases here.

In my proposed solution, the rustdoc entry for the pub use second::*; item would include a list field that effectively says "type name Foo is not usable here" when the implicit constructor is imported and usable. When the implicit constructor is ambiguous and unusable, the field would list both "type name Foo" and "value name Foo" as unusable. I'm agnostic to the specific naming or structure of these records, so I didn't propose any so we don't accidentally start bikeshedding.

jyn514 commented 1 year ago

IIUC there are actually two separate problems:

  1. Finding the ambiguity requires knowledge about private imports
  2. Finding the ambiguity requires intimate knowledge about how namespaces work (e.g. the existence of Ctor).

So a fix in rustdoc-json actually requires not just a list of the private imports ("unusable names"), but also the namespace they're in — and for that to be useful, you also need to know the namespace of the public imports.

I have several thoughts:

  1. I agree we should fix this in the default format so it doesn't require passing --document-private-items.
  2. Exposing the namespace makes me a little nervous; I don't think we do anything like that today. That said I imagine you're working around it already by hard-coding struct => type ns or something like that. I wonder if we should put that directly in rustdoc_json_types (not the JSON format itself) so each consumer of the library doesn't have to reimplement the mapping. That won't help for "unusable names", we'll still need to expose the namespace (because I don't want to expose the kind for private items), but it should make at least the public side easier.
  3. I am not sure "unusable names" is the right way to address this. I'd rather avoid exposing glob exports in the output altogether, and instead of have a list of only the unambiguous exports - basically the work you're already doing in semver-checks, but without giving you access to private imports.
obi1kenobi commented 1 year ago

Thanks for helping think through the options here, I appreciate it!

re: 2., I'm worried it could be difficult to do — though that makes it all the more valuable if it can be done! The mapping isn't as straightforward as "struct => type ns". For example: unit structs' names go both into the type and the value ns, and tuple structs always go into the type ns but if their Ctor is visible they also go into the value ns. And whether the constructor is visible requires knowing about private fields, etc. If there's a reasonable way to help consumers not have to implement this logic, I'm all for it!

re: 3, a key concern is the size of the rustdoc JSON file. There are crates that when scanned by cargo-semver-checks produce ~220MB of JSON by themselves, without dependencies. Glob exports and unusable names are a way to keep file size down, at the expense of more complex logic for tracking what's exported.

In my proposal, unusable names wouldn't contain names of private items unless they shadow (or make ambiguous) a pub glob's imports. That means each unusable name appears at most once per local definition (either import or new item), and at most once per module where a glob import that causes ambiguity. This is O(n) total and I believe it is the minimum possible.

Perhaps I've misunderstood the details of listing unambiguous exports is, so please excuse me if I'm off base here. As I understand it, I think listing unambiguous exports is O(n^2). Consider a program like:

pub mod a {
    pub mod b {
        pub mod c {
            pub struct Foo;

            pub enum Bar {
                First,
                Second,
            }

            pub use Bar::*;
        }

        pub use c::*;
    }

    pub use b::*;
}

pub use a::*;

and see how many times Foo or First would appear here.

Also consider cyclic imports that create infinite importable paths, like here. To avoid infinite loops in such cases, the unambiguous path listing could use a rule like "not listing paths that visit the same module more than once" (as cargo-semver-checks does), but then it'd be strange that Rust is able to successfully resolve paths that the unambiguous path listing doesn't include.

While my two examples may seem contrived, both cyclic imports and fully public multi-layered glob chains like that are used in popular Rust crates today.

jyn514 commented 1 year ago

re: 3, a key concern is the size of the rustdoc JSON file. There are crates that when scanned by cargo-semver-checks produce ~220MB of JSON by themselves, without dependencies. Glob exports and unusable names are a way to keep file size down, at the expense of more complex logic for tracking what's exported.

hmm, this is a very good point. I want to look into exactly why it's so large at some point (I suspect a lot of it will go away when we get rid of paths), but it's true that getting rid of globs means the output is fundamentally larger. I think if we only refer to items by their ID, without inlining, it might not be much larger in practice? I don't have data to back that up, though.

For example: unit structs' names go both into the type and the value ns, and tuple structs always go into the type ns but if their Ctor is visible they also go into the value ns

Yes, agreed this is a tricky case. I'm not sure what our output looks like for that currently; do we only emit the struct once no matter how many namespaces it's in? If so we probably do need at least to emit directly in the JSON whether the Ctor is private for the helper function to work. That seems doable though, and in the spirit of the existing data we expose, I don't have any reservations about adding it.

obi1kenobi commented 1 year ago

I don't think there's any inlining left anymore. Foreign traits implemented by local types are no longer inlined, and crate items are only emitted once even if they are exported by multiple namespaces, leaving the name resolution portion to the user. This is obviously a bit tricky on the user side, but IMHO is the right choice given the constraints.

I don't think paths contributes significantly to file size, to be honest. If optimizing for size, I'd start by leaning more heavily into omitting default representation like None and [] and {} and assuming that something like serde will fill in the blanks. The 220MB JSON output is just because that crate (aws-sdk-ec2) is gigantic — 240k autogenerated items, ~17x more than syn with the full feature.

I think exporting unusable names would make emitting whether the Ctor is public unnecessary (though I'm not opposed to it!). Ctor visibility resolution is only necessary to resolve possible glob import shadowing or ambiguity, because there the flavors of restricted visibility matter as in my example two posts ago. For merely public / non-public levels, it's enough to look at the type's field visibilities which doesn't actually require private items: if fields were stripped, the Ctor is private too.

jyn514 commented 1 year ago

I think inlining is the wrong word - I was imagining a single JSON object for all imports in the same module, with kind: use and a list of the IDs that were imported. That shouldn't be too much larger than the glob imports I think.

I agree we should either do my ctor visibility idea or your unusable imports idea but not both.

obi1kenobi commented 1 year ago

Oh I see, interesting idea. I'll need to think about it a bit more.

On Sat, May 13, 2023, 2:40 PM jyn @.***> wrote:

I think inlining is the wrong word - I was imagining a single JSON object for all imports in the same module, with kind: use and a list of the IDs that were imported. That shouldn't be too much larger than the glob imports I think.

I agree we should either do my ctor visibility idea or your unusable imports idea but not both.

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/111338#issuecomment-1546727986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5MSXC2AZ3WAI2FZ3EEN3XF7IRPANCNFSM6AAAAAAXZNSXEY . You are receiving this because you authored the thread.Message ID: @.***>

obi1kenobi commented 1 year ago

A few points worth chatting about that I came up with: