Open markusicu opened 3 months ago
Discussion in CLDR/ICU design meeting 20240506:
Options
TODO(markus): Talk with ESR, see if it would be acceptable to use a simplified emoji sort order without the <<
distinctions in order to move this into the DUCET? This would remove concerns about data size, and it would likely avoid problems with the sifter tool that generates the DUCET data.
Possible simplified sort order. Copied from https://github.com/unicode-org/cldr/blob/main/common/collation/root.xml#L953 and then removed contractions for most of the ZWJ sequences, and expansions for people-holding-hands. I kept the keycaps and flags.
I did this manually, for discussion, so it may not be 100% right. And I am not sure about some contractions that make emoji with U+FE0F VARIATION SELECTOR-16 (VS16) sort the same as those without. My hacking is probably inconsistent there.
In the end, I also kept some ZWJ sequence contractions for things like lime, broken link, etc., assuming that we can support a small-ish number of them. (Will need some work in the sifter tool.)
Once we agree on an approach, we will need to modify the generator code and get the real thing. (And once we agree on that, we need to get it into DUCET input format (unidata.txt.)
For trying this out, either build an ICU RuleBasedCollator for the rules, or paste them into the "Append rules" box of the ICU Collation Demo.
& [last primary ignorable]<<*🦰🦱🦳🦲🏻🏼🏽🏾🏿
& [before 1]\uFDD1€
<*😀😃😄😁😆😅🤣😂🙂🙃🫠😉😊😇
<*🥰😍🤩😘😗☺😚😙🥲
<*😋😛😜🤪😝🤑
<*🤗🤭🫢🫣🤫🤔🫡
<*🤐🤨😐😑😶🫥
< 😶🌫
<*😏😒🙄😬
< 😮💨
<*🤥🫨
< 🙂↔
< 🙂↕
<*😌😔😪🤤😴
<*😷🤒🤕🤢🤮🤧🥵🥶🥴😵
< 😵💫
<*🤯
<*🤠🥳🥸
<*😎🤓🧐
<*😕🫤😟🙁☹😮😯😲😳🥺🥹😦😧😨😰😥😢😭😱😖😣😞😓😩😫🥱
<*😤😡😠🤬😈👿💀☠
<*💩🤡👹👺👻👽👾🤖
<*😺😸😹😻😼😽🙀😿😾
<*🙈🙉🙊
<*💌💘💝💖💗💓💞💕💟❣💔
< ❤🔥 = ❤️🔥
< ❤🩹 = ❤️🩹
<*❤🩷🧡💛💚💙🩵💜🤎🖤🩶🤍
<*💋💯💢💥💫💦💨🕳💬
< 👁🗨 = 👁️🗨
<*🗨🗯💭💤
<*👋🤚🖐✋🖖🫱🫲🫳🫴🫷🫸
<*👌🤌🤏✌🤞🫰🤟🤘🤙
<*👈👉👆🖕👇☝🫵
<*👍👎✊👊🤛🤜
<*👏🙌🫶👐🤲🤝🙏
<*✍💅🤳
<*💪🦾🦿🦵🦶👂🦻👃🧠🫀🫁🦷🦴👀👁👅👄🫦
<*👶🧒👦👧🧑👱👨🧔
<*👩
<*🧓👴👵
<*🙍
<*🙎
<*🙅
<*🙆
<*💁
<*🙋
<*🧏
<*🙇
<*🤦
<*🤷
<*👮
<*🕵
<*💂
<*🥷👷
<*🫅🤴👸👳
<*👲🧕🤵
<*👰
<*🤰🫃🫄🤱
<*👼🎅🤶
<*🦸
<*🦹
<*🧙
<*🧚
<*🧛
<*🧜
<*🧝
<*🧞
<*🧟
<*🧌
<*💆
<*💇
<*🚶
<*🧍
<*🧎
<*🏃
<*💃🕺🕴👯
<*🧖
<*🧗
<*🤺🏇⛷🏂🏌
<*🏄
<*🚣
<*🏊
<*⛹
<*🏋
<*🚴
<*🚵
<*🤸
<*🤼
<*🤽
<*🤾
<*🤹
<*🧘
<*🛀🛌
<*💏
<*💑
<*🗣👤👥🫂👪
<*👣
<*🦰🦱🦳🦲
<*🐵🐒🦍🦧🐶🐕🦮
< 🐕🦺
<*🐩🐺🦊🦝🐱🐈
< 🐈⬛
<*🦁🐯🐅🐆🐴🫎🫏🐎🦄🦓🦌🦬🐮🐂🐃🐄🐷🐖🐗🐽🐏🐑🐐🐪🐫🦙🦒🐘🦣🦏🦛🐭🐁🐀🐹🐰🐇🐿🦫🦔🦇🐻
< 🐻❄
<*🐨🐼🦥🦦🦨🦘🦡🐾
<*🦃🐔🐓🐣🐤🐥🐦🐧🕊🦅🦆🦢🦉🦤🪶🦩🦚🦜🪽
< 🐦⬛
<*🪿
< 🐦🔥
<*🐸
<*🐊🐢🦎🐍🐲🐉🦕🦖
<*🐳🐋🐬🦭🐟🐠🐡🦈🐙🐚🪸🪼
<*🐌🦋🐛🐜🐝🪲🐞🦗🪳🕷🕸🦂🦟🪰🪱🦠
<*💐🌸💮🪷🏵🌹🥀🌺🌻🌼🌷🪻
<*🌱🪴🌲🌳🌴🌵🌾🌿☘🍀🍁🍂🍃🪹🪺🍄
<*🍇🍈🍉🍊🍋
< 🍋🟩
<*🍌🍍🥭🍎🍏🍐🍑🍒🍓🫐🥝🍅🫒🥥
<*🥑🍆🥔🥕🌽🌶🫑🥒🥬🥦🧄🧅🥜🫘🌰🫚🫛
< 🍄🟫
<*🍞🥐🥖🫓🥨🥯🥞🧇🧀🍖🍗🥩🥓🍔🍟🍕🌭🥪🌮🌯🫔🥙🧆🥚🍳🥘🍲🫕🥣🥗🍿🧈🧂🥫
<*🍱🍘🍙🍚🍛🍜🍝🍠🍢🍣🍤🍥🥮🍡🥟🥠🥡
<*🦀🦞🦐🦑🦪
<*🍦🍧🍨🍩🍪🎂🍰🧁🥧🍫🍬🍭🍮🍯
<*🍼🥛☕🫖🍵🍶🍾🍷🍸🍹🍺🍻🥂🥃🫗🥤🧋🧃🧉🧊
<*🥢🍽🍴🥄🔪🫙🏺
<*🌍🌎🌏🌐🗺🗾🧭
<*🏔⛰🌋🗻🏕🏖🏜🏝🏞
<*🏟🏛🏗🧱🪨🪵🛖🏘🏚🏠🏡🏢🏣🏤🏥🏦🏨🏩🏪🏫🏬🏭🏯🏰💒🗼🗽
<*⛪🕌🛕🕍⛩🕋
<*⛲⛺🌁🌃🏙🌄🌅🌆🌇🌉♨🎠🛝🎡🎢💈🎪
<*🚂🚃🚄🚅🚆🚇🚈🚉🚊🚝🚞🚋🚌🚍🚎🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🛻🚚🚛🚜🏎🏍🛵🦽🦼🛺🚲🛴🛹🛼🚏🛣🛤🛢⛽🛞🚨🚥🚦🛑🚧
<*⚓🛟⛵🛶🚤🛳⛴🛥🚢
<*✈🛩🛫🛬🪂💺🚁🚟🚠🚡🛰🚀🛸
<*🛎🧳
<*⌛⏳⌚⏰⏱⏲🕰🕛🕧🕐🕜🕑🕝🕒🕞🕓🕟🕔🕠🕕🕡🕖🕢🕗🕣🕘🕤🕙🕥🕚🕦
<*🌑🌒🌓🌔🌕🌖🌗🌘🌙🌚🌛🌜🌡☀🌝🌞🪐⭐🌟🌠🌌☁⛅⛈🌤🌥🌦🌧🌨🌩🌪🌫🌬🌀🌈🌂☂☔⛱⚡❄☃⛄☄🔥💧🌊
<*🎃🎄🎆🎇🧨✨🎈🎉🎊🎋🎍🎎🎏🎐🎑🧧🎀🎁🎗🎟🎫
<*🎖🏆🏅🥇🥈🥉
<*⚽⚾🥎🏀🏐🏈🏉🎾🥏🎳🏏🏑🏒🥍🏓🏸🥊🥋🥅⛳⛸🎣🤿🎽🎿🛷🥌
<*🎯🪀🪁🔫🎱🔮🪄🎮🕹🎰🎲🧩🧸🪅🪩🪆♠♥♦♣♟🃏🀄🎴
<*🎭🖼🎨🧵🪡🧶🪢
<*👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩱🩲🩳👙👚🪭👛👜👝🛍🎒🩴👞👟🥾🥿👠👡🩰👢🪮👑👒🎩🎓🧢🪖⛑📿💄💍💎
<*🔇🔈🔉🔊📢📣📯🔔🔕
<*🎼🎵🎶🎙🎚🎛🎤🎧📻
<*🎷🪗🎸🎹🎺🎻🪕🥁🪘🪇🪈
<*📱📲☎📞📟📠
<*🔋🪫🔌💻🖥🖨⌨🖱🖲💽💾💿📀🧮
<*🎥🎞📽🎬📺📷📸📹📼🔍🔎🕯💡🔦🏮🪔
<*📔📕📖📗📘📙📚📓📒📃📜📄📰🗞📑🔖🏷
<*💰🪙💴💵💶💷💸💳🧾💹
<*✉📧📨📩📤📥📦📫📪📬📭📮🗳
<*✏✒🖋🖊🖌🖍📝
<*💼📁📂🗂📅📆🗒🗓📇📈📉📊📋📌📍📎🖇📏📐✂🗃🗄🗑
<*🔒🔓🔏🔐🔑🗝
<*🔨🪓⛏⚒🛠🗡⚔💣🪃🏹🛡🪚🔧🪛🔩⚙🗜⚖🦯🔗
< ⛓💥 = ⛓️💥
<*⛓🪝🧰🧲🪜
<*⚗🧪🧫🧬🔬🔭📡
<*💉🩸💊🩹🩼🩺🩻
<*🚪🛗🪞🪟🛏🛋🪑🚽🪠🚿🛁🪤🪒🧴🧷🧹🧺🧻🪣🧼🫧🪥🧽🧯🛒
<*🚬⚰🪦⚱🧿🪬🗿🪧🪪
<*🏧🚮🚰♿🚹🚺🚻🚼🚾🛂🛃🛄🛅
<*⚠🚸⛔🚫🚳🚭🚯🚱🚷📵🔞☢☣
<*⬆↗➡↘⬇↙⬅↖↕↔↩↪⤴⤵🔃🔄🔙🔚🔛🔜🔝
<*🛐⚛🕉✡☸☯✝☦☪☮🕎🔯🪯
<*♈♉♊♋♌♍♎♏♐♑♒♓⛎
<*🔀🔁🔂▶⏩⏭⏯◀⏪⏮🔼⏫🔽⏬⏸⏹⏺⏏🎦🔅🔆📶🛜📳📴
<*♀♂⚧
<*✖➕➖➗🟰♾
<*‼⁉❓❔❕❗〰
<*💱💲
<*⚕♻⚜🔱📛🔰⭕✅☑✔❌❎➰➿〽✳✴❇©®™
< '#⃣' = '#️⃣'
< '*⃣' = '*️⃣'
< 0⃣ = 0️⃣
< 1⃣ = 1️⃣
< 2⃣ = 2️⃣
< 3⃣ = 3️⃣
< 4⃣ = 4️⃣
< 5⃣ = 5️⃣
< 6⃣ = 6️⃣
< 7⃣ = 7️⃣
< 8⃣ = 8️⃣
< 9⃣ = 9️⃣
<*🔟
<*🔠🔡🔢🔣🔤🅰🆎🅱🆑🆒🆓ℹ🆔Ⓜ🆕🆖🅾🆗🅿🆘🆙🆚🈁🈂🈷🈶🈯🉐🈹🈚🈲🉑🈸🈴🈳㊗㊙🈺🈵
<*🔴🟠🟡🟢🔵🟣🟤⚫⚪🟥🟧🟨🟩🟦🟪🟫⬛⬜◼◻◾◽▪▫🔶🔷🔸🔹🔺🔻💠🔘🔳🔲
<*🏁🚩🎌🏴🏳
< 🏳🌈 = 🏳️🌈
< 🏳⚧ = 🏳️⚧
< 🏴☠
<*🇦🇧🇨🇩🇪🇫🇬🇭🇮🇯🇰🇱🇲🇳🇴🇵🇶🇷🇸🇹🇺🇻🇼🇽🇾🇿
< 🏴
< 🏴
< 🏴
Goals for this issue:
The DUCET could in principle sort symbols arbitrarily, for example by code point. However, it defines a bespoke sort order: https://www.unicode.org/charts/collation/chart_General-Symbol.html
The DUCET sort order of emoji generally does not group similar emoji together unless they have adjacent code points.
At least one Unicode member organization has bug reports about the sort order of emoji.
UTS51 has long defined a grouping and sort order for emoji:
& [before 1]€
FDD1 20AC; [0D 8A 02, 05, 05] # CURRENCY first primary
CLDR has long included a collation tailoring for this (see above), but it is hard to use.
CLDR has ticket CLDR-10745 “Merge emoji into CLDR root”. If the emoji sort order were built into the default sort order, then it would be always available.
We want the DUCET and CLDR root default sort orders to be the same.
If we agree to move the UTS51 emoji sort order into both default sort orders, then the cleanest way to do so is to modify the DUCET input data file, together with modifying the code that parses this file and outputs the actual sort order file so that it can handle whatever we need for this that it does not already handle.