Closed makotokato closed 1 year ago
CC @Manishearth @robertbastian. We were discussing this.
Yeah so there are two ways to do this:
icu_capi_staticlib
, that contains this code. It does not use Diplomat, instead it uses a manual extern "C"
function that returns a *mut ICU4XDataProvider
. You can recoup the ICU4XDataProvider C++ object on the other side using the ICU4XDataProvider
constructor from a capi::ICU4XDataProvider *
. The second is the preferred option. In general users are likely to want to depend on a shim crate like icu_capi_staticlib instead of directly on icu_capi, and you can put stuff in manually there.
In the long run, we would like datagen to autogenerate this code and the headers for it[^1], but it doesn't right now
[^1]: Just for the baked provider construction API; it should assume the existence of the other headers.
A few sub-issues:
Requires knowledge of how our FFI works.
@hsivonen had some more ideas here, such as a nullptr data provider indicating that data should instead be loaded from a BakedDataProvider that never itself gets exposed over FFI.
I think the key observations are:
BakedDataProvider
, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to the BakedDataProvider
at the points where a provider is passed to the Rust side's _unstable
constructor.BakedDataProvider
per the first point regardless of what the third-party code passes as the provider.)I think either myself or @robertbastian should take this issue.
Without diving into this, I don't think a null pointer is the right API design in 2022.
I would like to explore extern "Rust"
to separate icu_capi
from the baked data.
- If the app knows it's going to only use a
BakedDataProvider
, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to theBakedDataProvider
at the points where a provider is passed to the Rust side's_unstable
constructor.
So this seems pretty invasive to me. I think a very important thing to consider is how we can do this in a way that makes sense generally from Diplomat's POV.
Perhaps via a #[diplomat::attr(inject-mode, inject(some rust expression))]
where we annotate each provider parameter with it, and when diplomat is run with inject-mode
it generates APIs without the parameter, and instead initializes them from the injected code.
Note that from Diplomat's POV it would still need to be an instance of ICU4XDataProvider, we can't get rid of that indirection that easily. I guess it could be inject(some rust expression, some rust type)
where the type is also replaced.
You'd also still need to support a way to inject the baked data into icu_capi
. I suspect it can be done with some CFGs that require the baked data be found in some folder or the other.
In general I'm wary of jumping towards a solution in Diplomat for stuff like this, because it often feels kinda niche and while I definitely want to add Diplomat features I want them to be generally useful so that we can design them right.
There's another solution which I find to be cleaner: What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder (which we need to do anyway!)? Then users can disable any/buffer, and just set up the baked variant. This still involves the baked data provider crossing FFI, however it doesn't matter since if all other features are disabled, the impl of DataProvider
for ICU4XDataProviderInner
will always hit the baked provider (especially if we get rid of the Empty variant too), and this will all monomorphize correctly.
This can be done purely in CFG magic, and we'd add a configurable ICU4XDataProvider::new_baked()
¹. The main tricky thing is figuring out how best to let people inject their own baked code (CFG + include!()
+ env!()
?) but that's a problem inherent to this problem space.
¹ A thing I do want to do is make it possible for Diplomat to also respect CFGs and not generate headers in these cases.
@robertbastian
I would like to explore
extern "Rust"
to separateicu_capi
from the baked data.
What do you mean by this?
Oh, also, a thing that does not solve the monomorphization problem, but does solve the problem of using baked providers, is cross-crate Diplomat support, which I also want to get.
Technically we could implement the injection thing by doing this, btw:
fn try_new(#[cfg(not(baked))] provider: &..., options: ... ) -> ... {
#[cfg(baked)] let provider = BakedProvider;
}
with the caveat that Diplomat does not yet support CFGs (but I plan to make that happen)
What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder
+1
@sffc @robertbastian and I decided today to go with the model in https://github.com/unicode-org/icu4x/pull/2992 (as opposed to https://github.com/unicode-org/icu4x/pull/2947), because it maintains the same model across Rust and FFI.
The basic idea is that we have a globaldata
(name TBD) crate that contains a bunch of Rust code in macros, covering all baked testdata. Individual components use a macro to import the Rust code for the relevant keys, and expose try_new_with_globaldata()
(name TBD) functions, behind a globaldata
feature. FFI does something similar. Ideally components only macro-include the data for their own component, and if they need data from other components we can either doc(hidden)
export LocalBakedData
from each component and fork the provider, or we can use try_new()
functions for that module. It's all internal, the precise strategy can be figured out.
globaldata
can be replaced with a custom baked rust source using an env var.
There is one other model that we explored a bit but decided not to go for at the moment. It will be possible to do in the future and we can do it when someone needs it.
The main benefit it has is that it gives FFI the flexibility of having multiple subsets of baked data: with the globaldata
model you either have all data, or you have one defined subset that's defined for the entire compilation tree. You can't do multiple.
However, a model we can also have is one where datagen is capable of generating baked data that additionally:
icu_capi
ICU4XDataProvider
constructed from the baked dataThe final staticlib must depend on all such crates for compilation to work (this is not a problem if you're doing rlib-to-rlib-to-native linking the way Blaze does; you can continue doing that). But this gives FFI the level of flexibility pure Rust code already has when it comes to baked data.
Writing this down so that in the future if this need crops up, we have a path we can take.
The approach of using macros to store tokens doesn't seem to work. Paths in macros are validated somehow, so even if the macro is unused, the compiler complains about use of undeclared crate or module
.
Can you push up some code? I wonder if this has to do with hygeine. We may need to pass the crate names in as macro arguments and reexport them from the root.
Yes we'll need to pass all crate names to the macro, which we can actually do because we have the CrateEnv
, however the ergonomics are a bit shit.
Code in #3449
Due to crates.io size limits, we need multiple smaller globaldata crates, so I think we should start with one data crate per component crate.
I put minimal value on ergonomics that primarily impact ICU4X engineers.
I don't think the size limit is something that should affect our design, it's straightforward to get an exception. I don't want to end up with 15 data crates that all need to be replaced individually if they are vendored.
It's not just the crates.io limit. Smaller globaldata crates means that people who only want to build individual components of ICU4X don't need to download/vendor all data for all components, which is going to get quite large. It might also have a marginal improvement on build times, since the compiler doesn't need to tokenize all the macros we're not using.
(I thought we were already aligned on the idea of multiple smaller crates)
False alarm! The errors were coming from the backward-compatibility macro invocation that databake generates. I've updated the POC.
Name bikeshed!
data
, which Manish pointed out is very generic.builtin-data
is a bit more specific, but isn't actually correct if the user provides their own data by environment variablemodern-data
is another suggestion, as it contains the modern
CLDR set, but has the same problem with BYODno-provider
somewhat describes the type of constructors it unlocks, but it's a negative so weird as a feature namebaked-data
is my favourite I think, it's a more precise version of builtin-data
that conveys only the format, not where the data actually comes fromIt's also global as in global state I think :smile:
Note: All modern locales in one crate is 5.1MB for stable keys, 6.3MB for stable + experimental.
I'm going to close this in favour of #2945 as it's basically discussing the same things, and the solution we're working on there trivially extends to FFI.
icu_capi
cannot handleBakedDataProvider
directly. Is there a way to handle it without modifyingicu_capi
?Of course, we can handle it if we modify
icu_capi
like the following.