Data generation part of intro needs more clarity about icuexport

hsivonen commented 2 years ago

Currently the text at https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/intro.md#generating-data talks about CLDR and doesn't explain what ICU-exported data is. It should explain that ICU-exported data covers Unicode Database data and CLDR data for collation. That is, the reader shouldn't assume that all CLDR-originating data is bundled for ICU4X use via --cldr-tag.

robertbastian commented 2 years ago

Do you think the new tutorial does this better?

How much do you think the user needs to know? I'd say anything beyond --icuexport-tag=latest is pretty advanced.

hsivonen commented 2 years ago

The new tutorial is much better. Thank you! Three points:

It's not linked from https://github.com/unicode-org/icu4x/blob/main/docs/README.md
The intro sentence is suggestive that icu_testdata might be appropriate for an app. Instead, it should probably more clearly make the case that icu_testdata is for demonstration only, since the set of locales is picked for the purpose of exercising interesting things.
This is more of a datagen issue than a doc issue, but what the doc says results in generating data for the search collations. This is making the export uselessly larger than is useful, because we don't have a search API. In the absence of a search API, it's generally a bad idea to generate the search collation data. (Only a Web browser might want to have that data and only for theoretical compatibility with the oddity that search collations are at present exposed via a non-search API.)

hsivonen commented 2 years ago

Filed #2708 about the search collation data.

unicode-org / icu4x

Data generation part of intro needs more clarity about icuexport #2677