unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

C/C++ FFI for BakedDataProvider #2743

Closed makotokato closed 1 year ago

makotokato commented 2 years ago

icu_capi cannot handle BakedDataProvider directly. Is there a way to handle it without modifying icu_capi?

Of course, we can handle it if we modify icu_capi like the following.

#[cfg(feature = "any_provider")]
include!(concat!(env!("CARGO_MANIFEST_DIR"), "/data/mod.rs"));
#[cfg(feature = "any_provider")]
include!(concat!(env!("CARGO_MANIFEST_DIR"), "/data/any.rs"));

#[diplomat::rust_link(BakedDataProvider, Struct)]
pub fn create_baked() -> Box<ICU4XDataProvider> {
    #[cfg(not(feature = "any_provider"))]
    panic!("Requires feature 'any_provider'");

    #[cfg(feature = "any_provider")]
    convert_any_provider(BakedDataProvider)
sffc commented 2 years ago

CC @Manishearth @robertbastian. We were discussing this.

Manishearth commented 2 years ago

Yeah so there are two ways to do this:

The second is the preferred option. In general users are likely to want to depend on a shim crate like icu_capi_staticlib instead of directly on icu_capi, and you can put stuff in manually there.

In the long run, we would like datagen to autogenerate this code and the headers for it[^1], but it doesn't right now

[^1]: Just for the baked provider construction API; it should assume the existence of the other headers.

sffc commented 2 years ago

A few sub-issues:

  1. Document status quo
  2. Implement better solution (start with C)
  3. Integrate with Diplomat

Requires knowledge of how our FFI works.

sffc commented 1 year ago

@hsivonen had some more ideas here, such as a nullptr data provider indicating that data should instead be loaded from a BakedDataProvider that never itself gets exposed over FFI.

hsivonen commented 1 year ago

I think the key observations are:

  1. If the app knows it's going to only use a BakedDataProvider, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to the BakedDataProvider at the points where a provider is passed to the Rust side's _unstable constructor.
  2. If the first point happens and also the app uses its own C++ types for spans and strings, the app needs its own C++ header generation run anyway, at which point it would make sense to be able to request that Diplomat omit the provider argument in the generated C++.
  3. Even if the first and second point happen, it might still make sense to keep a meaningless provider argument on the C layer. This way, it would be possible for the app to incorporate third-party code that uses the ICU4X FFI and assumes the upstream ICU4X C ABI. (These uses would go to the BakedDataProvider per the first point regardless of what the third-party code passes as the provider.)
sffc commented 1 year ago

I think either myself or @robertbastian should take this issue.

robertbastian commented 1 year ago

Without diving into this, I don't think a null pointer is the right API design in 2022.

I would like to explore extern "Rust" to separate icu_capi from the baked data.

Manishearth commented 1 year ago
  • If the app knows it's going to only use a BakedDataProvider, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to the BakedDataProvider at the points where a provider is passed to the Rust side's _unstable constructor.

So this seems pretty invasive to me. I think a very important thing to consider is how we can do this in a way that makes sense generally from Diplomat's POV.

Perhaps via a #[diplomat::attr(inject-mode, inject(some rust expression))] where we annotate each provider parameter with it, and when diplomat is run with inject-mode it generates APIs without the parameter, and instead initializes them from the injected code.

Note that from Diplomat's POV it would still need to be an instance of ICU4XDataProvider, we can't get rid of that indirection that easily. I guess it could be inject(some rust expression, some rust type) where the type is also replaced.

You'd also still need to support a way to inject the baked data into icu_capi. I suspect it can be done with some CFGs that require the baked data be found in some folder or the other.

In general I'm wary of jumping towards a solution in Diplomat for stuff like this, because it often feels kinda niche and while I definitely want to add Diplomat features I want them to be generally useful so that we can design them right.


There's another solution which I find to be cleaner: What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder (which we need to do anyway!)? Then users can disable any/buffer, and just set up the baked variant. This still involves the baked data provider crossing FFI, however it doesn't matter since if all other features are disabled, the impl of DataProvider for ICU4XDataProviderInner will always hit the baked provider (especially if we get rid of the Empty variant too), and this will all monomorphize correctly.

This can be done purely in CFG magic, and we'd add a configurable ICU4XDataProvider::new_baked()¹. The main tricky thing is figuring out how best to let people inject their own baked code (CFG + include!() + env!()?) but that's a problem inherent to this problem space.

¹ A thing I do want to do is make it possible for Diplomat to also respect CFGs and not generate headers in these cases.

Manishearth commented 1 year ago

@robertbastian

I would like to explore extern "Rust" to separate icu_capi from the baked data.

What do you mean by this?

Manishearth commented 1 year ago

Oh, also, a thing that does not solve the monomorphization problem, but does solve the problem of using baked providers, is cross-crate Diplomat support, which I also want to get.

Manishearth commented 1 year ago

Technically we could implement the injection thing by doing this, btw:

fn try_new(#[cfg(not(baked))] provider: &..., options: ... ) -> ... {
    #[cfg(baked)] let provider = BakedProvider;

}

with the caveat that Diplomat does not yet support CFGs (but I plan to make that happen)

sffc commented 1 year ago

What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder

+1

Manishearth commented 1 year ago

@sffc @robertbastian and I decided today to go with the model in https://github.com/unicode-org/icu4x/pull/2992 (as opposed to https://github.com/unicode-org/icu4x/pull/2947), because it maintains the same model across Rust and FFI.

The basic idea is that we have a globaldata (name TBD) crate that contains a bunch of Rust code in macros, covering all baked testdata. Individual components use a macro to import the Rust code for the relevant keys, and expose try_new_with_globaldata() (name TBD) functions, behind a globaldata feature. FFI does something similar. Ideally components only macro-include the data for their own component, and if they need data from other components we can either doc(hidden) export LocalBakedData from each component and fork the provider, or we can use try_new() functions for that module. It's all internal, the precise strategy can be figured out.

globaldata can be replaced with a custom baked rust source using an env var.


There is one other model that we explored a bit but decided not to go for at the moment. It will be possible to do in the future and we can do it when someone needs it.

The main benefit it has is that it gives FFI the flexibility of having multiple subsets of baked data: with the globaldata model you either have all data, or you have one defined subset that's defined for the entire compilation tree. You can't do multiple.

However, a model we can also have is one where datagen is capable of generating baked data that additionally:

The final staticlib must depend on all such crates for compilation to work (this is not a problem if you're doing rlib-to-rlib-to-native linking the way Blaze does; you can continue doing that). But this gives FFI the level of flexibility pure Rust code already has when it comes to baked data.

Writing this down so that in the future if this need crops up, we have a path we can take.

robertbastian commented 1 year ago

The approach of using macros to store tokens doesn't seem to work. Paths in macros are validated somehow, so even if the macro is unused, the compiler complains about use of undeclared crate or module.

Manishearth commented 1 year ago

Can you push up some code? I wonder if this has to do with hygeine. We may need to pass the crate names in as macro arguments and reexport them from the root.

robertbastian commented 1 year ago

Yes we'll need to pass all crate names to the macro, which we can actually do because we have the CrateEnv, however the ergonomics are a bit shit.

robertbastian commented 1 year ago

Code in #3449

sffc commented 1 year ago

Due to crates.io size limits, we need multiple smaller globaldata crates, so I think we should start with one data crate per component crate.

I put minimal value on ergonomics that primarily impact ICU4X engineers.

robertbastian commented 1 year ago

I don't think the size limit is something that should affect our design, it's straightforward to get an exception. I don't want to end up with 15 data crates that all need to be replaced individually if they are vendored.

sffc commented 1 year ago

It's not just the crates.io limit. Smaller globaldata crates means that people who only want to build individual components of ICU4X don't need to download/vendor all data for all components, which is going to get quite large. It might also have a marginal improvement on build times, since the compiler doesn't need to tokenize all the macros we're not using.

(I thought we were already aligned on the idea of multiple smaller crates)

robertbastian commented 1 year ago

False alarm! The errors were coming from the backward-compatibility macro invocation that databake generates. I've updated the POC.

robertbastian commented 1 year ago

Name bikeshed!

Manishearth commented 1 year ago

It's also global as in global state I think :smile:

robertbastian commented 1 year ago

Note: All modern locales in one crate is 5.1MB for stable keys, 6.3MB for stable + experimental.

robertbastian commented 1 year ago

I'm going to close this in favour of #2945 as it's basically discussing the same things, and the solution we're working on there trivially extends to FFI.