Better document key selection

Manishearth commented 2 years ago

We cover this a little bit in @robertbastian's new data management tutorial, but we probably should improve this: Most users will want to select a subset of keys, and keyextract is one way of doing so, but sometimes you need to slice in ways not covered by keyextract, or are operating in a situation where keyextract won't work (like a wasm build).

We probably should explain keys in more detail somewhere (maybe in the same tutorial, or a separate advanced key management page), and we should also probably document which keys are needed on constructors, including caveats like "you only need calendar-specific keys if you are using that calendar" and "you only need key X if you are planning on using option Y"

cc @zbraniecki @sffc

zbraniecki commented 2 years ago

Additionally, we should improve DX by providing notice information to datagen keyextract on gotcha-scenarios.

I imagine notices like your binary will generate list of keys including all calendar systems. or Your binary will generate only gregorian calendar data. Are you sure your application will never need other calendars?. Your binary only instruments to generate cardinal plural rules. Are you sure you won't need ordinals? etc.

robertbastian commented 2 years ago

Key extraction works fine for wasm.

Manishearth commented 2 years ago

@robertbastian it can't do it based on the JS apis actually being called, no?

sffc commented 2 years ago

A few things here:

With keyextract, there are no "gotchas"; that is the beauty of it. It will not try to remove calendar systems or plural variants if they are reachable by any code path.
Doing this in JS is possible, but it requires a few extra steps: first you need to DCE your Diplomat-generated JS code, then you need to rebuild the WASM file to export only the reachable entrypoints, and then you can run keyextract on that. We don't currently have a tutorial on how to do this, but it should be doable using pre-existing tooling. (Rebuilding the WASM file should be doable via LTO without having to touch Rust code, just like in C.)
We need to think a bit harder about how to support the forced removal of a subset of calendar systems. Currently we support "one or all"; there are a few ways we can think about supporting "two" or "three" or "all but one".
There should not be cases any more where users need to think about "you only need key X if you are planning on using option Y". This is why we have multiple constructors.

zbraniecki commented 2 years ago

With keyextract, there are no "gotchas"; that is the beauty of it. It will not try to remove calendar systems or plural variants if they are reachable by any code path.

Those are gotchas and they can go both ways. I can have my app use AnyDateTimeFormatter and it will package all calendars. I may not know that - a notice to inform me about it, and reference docs that allow me to make an informed decision to tailor it to only Gregorian if I'm certain I only need that should be there. I can have my app only bundle plurals/cardinal keys because I haven't yet add ordinals when I extracted keys. An information about that may help me make a more informed decision.

There should not be cases any more where users need to think about "you only need key X if you are planning on using option Y". This is why we have multiple constructors.

My point is that people may not understand that the choice of constructors impacts the payload size in such detail. We can help them understand that in edge cases.

sffc commented 2 years ago

Yep. Okay. The optimal use of keyextract assumes that developers have written their code in a certain way, utilizing smaller classes (liked TypeDateTimeFormatter) and constructors (like try_new_cardinal) when available. This improves both key selection and code size in general. We should have a tutorial that better explains why one would choose to use the other constructors.

The warning should therefore be something along the lines of, "All calendar systems are being bundled, likely due to the use of DateTimeFormatter. This produces the highest quality i18n. However, to reduce code and data size, consider using TypedDateTimeFormatter."

sffc commented 2 years ago

We should also consider making this a code lint. There's no reason we need to wait until datagen time. We could add some annotation to "big data" functions we export, and the lint would warn when people are using those functions in their code.

robertbastian commented 2 years ago

Ah the code I was looking at compiles a Rust example to WASM instead of using the FFI.

I fully agree with Shane: key extraction guarantees that all data that is mentioned by the compiled code is included. If any of these code paths cannot be reached at runtime, the code should be changed to eliminate unreachable paths. It's a lot easier for clients to change their usage of ICU4X to use more appropriate APIs, than for them to reason about code paths in our code to figure out what keys will actually be required at runtime. We can do this reasoning and provide appropriate APIs, and also keep them stable (users cannot rely on the internal behaviour of our code)[^1].

Now for DTF we have only one or all constructors, and I think that's fine for now; we can add more if/when someone needs them. We should document the code/data size hit on DTF's constructor, and add a lint message. The #[deprecated] attribute seems to be the only way to trigger a lint for just using a method:

#[deprecated(note = "This method is not code- or data-size optimized; it will pull in code paths and
data for all calendar systems. Consider using `TypedDateTimeFormatter` if you can statically determine
which calendar will be used.")]

I think manually specifying keys should be documented as extremely brittle and hacky, and should not be advertised in a tutorial.

[^1]: Stability of key-extract actually requires some thought. It should extract exactly the keys that are declared on a constructor's bounds, even if they are unused. Otherwise data packets using key-extract are not forward compatible. We could have a separate unstable mode that only includes actually used keys but makes the data non-forward compatible.

Manishearth commented 2 years ago

Doing this in JS is possible, but it requires a few extra steps: first you need to DCE your Diplomat-generated JS code, then you need to rebuild the WASM file to export only the reachable entrypoints, and then you can run keyextract on that. We don't currently have a tutorial on how to do this, but it should be doable using pre-existing tooling. (Rebuilding the WASM file should be doable via LTO without having to touch Rust code, just like in C.)

My understanding is that JS tree shaking tooling is somewhat coarse and imperfect. It works well to reduce the burden on the runtime when compared with the expected default of "loading everything", but it doesn't have to be perfect because it's an optimization. On the other hand the expected default of data loading ought to be "loading only what is necessary"; so I would consider any missed opportunity a big.

My general reaction to stuff like this echoes the sentiment in "I Bet That Almost Works": I think this can work, I also think there are plenty of situations where it'll almost work, and I'd rather have the experience of that to still be good.

We should also consider making this a code lint. There's no reason we need to wait until datagen time. We could add some annotation to "big data" functions we export, and the lint would warn when people are using those functions in their code.

This seems unlikely to be the kind of lint clippy would accept, fwiw. (I can explain the reasoning, but in general clippy has a particular bar for lints around 3rd party crates)

The #[deprecated] attribute seems to be the only way to trigger a lint for just using a method

Yeah, this is the only way to do custom lints right now; and the tricky thing is that we don't want to deprecate this.

I think manually specifying keys should be documented as extremely brittle and hacky, and should not be advertised in a tutorial.

So I think at the very least we should give people enough support to verify that keyextract has done the right thing (and a way to work around issues in more complex situations like over WASM where the tooling is imperfect). So we still ought to document key deps somewhere, and perhaps have a separate documentation page about this telling people to avoid manually specifying keys, but also explaining what manually specifying keys is like, and documenting key dependencies.

sffc commented 1 year ago

Discuss with:

@sffc
@Manishearth
@robertbastian

Optional:

@zbraniecki

Manishearth commented 7 months ago

My current position is that we should:

Document what keys are in power user documentation
Document how to manually find out the needed keys for a call (including non-unstable calls)

I no longer think we need to document keys used on a function, as long as the documentation above is sufficient. For code that may "pick" a key, like AnyCalendar, we perhaps should document it a little bit on the function, linking to the main key docs.

In the intervening time since this discussion we've made plans around Dart, etc where our keyextract tooling is not going to be sufficient; exactly the "I Bet That Almost Works" problem I was worried about in the past. We are working closely with the Dart people so this is not knowledge they absolutely need documented, but it would be nice, and it's clear that this is a use case.

robertbastian commented 6 months ago

@robertbastian - we can use binary analysis on the treeshaken dylib in Dart, alternatively we hardcode a lookup table in the datagen binary
@manishearth - we can use reachability analysis tools to calculate this table automatically (I'll investigate this)
@robertbastian - this seems less brittle than static analysis
@mosuem - we can also add any metadata to Dart's ResourceIdentifier
@robertbastian - this would need Diplomat support, which we kind of want to keep independent of ICU4X
@manishearth - we could add a resource_identifier attribute to Diplomat and then check that it's in sync with the table
@sffc - We have keyextract and it works. If it breaks we can fix it then. We don't have to fix it now because it isn't broken and there isn't a timeline for it to break.
@robertbastian - we currently don't notice if it breaks, we need a way to notice.

Conclusion:

Generate data table by running static analysis tool on synthetic binaries containing one FFI symbol each
Have CI to verify the Rust compiler doesn't change behaviour (i.e. the table doesn't change)
Use this table in Dart data filtering
If the compiler breaks or if we want to be careful about the compiler not throwing away keys, we can add a cfg(printf_all_data_keys) (in icu_provider) or something that forces reachability
In the medium term a separate reachability analysis tool may be nice to write if the libraries exist (manish to talk to fitzgen)

unicode-org / icu4x

Better document key selection #2685