unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
49 stars 36 forks source link

emoji sort order in the DUCET #773

Open markusicu opened 3 months ago

markusicu commented 3 months ago

Goals for this issue:

  1. Agree on whether to move the UTS51 emoji sort order into the DUCET
  2. If agreed: Figure out how to do it

The DUCET could in principle sort symbols arbitrarily, for example by code point. However, it defines a bespoke sort order: https://www.unicode.org/charts/collation/chart_General-Symbol.html

The DUCET sort order of emoji generally does not group similar emoji together unless they have adjacent code points.

At least one Unicode member organization has bug reports about the sort order of emoji.

UTS51 has long defined a grouping and sort order for emoji:

CLDR has long included a collation tailoring for this (see above), but it is hard to use.

CLDR has ticket CLDR-10745 “Merge emoji into CLDR root”. If the emoji sort order were built into the default sort order, then it would be always available.

We want the DUCET and CLDR root default sort orders to be the same.

If we agree to move the UTS51 emoji sort order into both default sort orders, then the cleanest way to do so is to modify the DUCET input data file, together with modifying the code that parses this file and outputs the actual sort order file so that it can handle whatever we need for this that it does not already handle.

markusicu commented 3 months ago

Discussion in CLDR/ICU design meeting 20240506:

Options

  1. Sort emoji as in the ESR ordering in root=DUCET
    1. Concern: The ordering has many contractions which could noticeably increase the size of the root collation data
  2. Sort emoji as in the ESR ordering in an optional variant of root=DUCET
    1. Goal: Mitigate size increase of root for people who are more sensitive to size than emoji order
    2. Orthogonal to the existing optional variant of unihan (implicit vs. radical-stroke)
    3. Would require either a build-time switch that also builds potentially different tailoring binaries for different root tables, or engineering to make sure that pre-built tailorings work with any of the root variants (e.g., affecting disjoint sets of characters)
  3. Sort emoji in root=DUCET with a space-optimized sort order (eg singer sorts as “person” followed by “microphone”)
    1. Work to create a new, compromise sort order with fewer contractions, then get that into root=DUCET
  4. No change – keep it in a hard-to-use tailoring
markusicu commented 3 months ago

TODO(markus): Talk with ESR, see if it would be acceptable to use a simplified emoji sort order without the << distinctions in order to move this into the DUCET? This would remove concerns about data size, and it would likely avoid problems with the sifter tool that generates the DUCET data.

markusicu commented 2 months ago

Possible simplified sort order. Copied from https://github.com/unicode-org/cldr/blob/main/common/collation/root.xml#L953 and then removed contractions for most of the ZWJ sequences, and expansions for people-holding-hands. I kept the keycaps and flags.

I did this manually, for discussion, so it may not be 100% right. And I am not sure about some contractions that make emoji with U+FE0F VARIATION SELECTOR-16 (VS16) sort the same as those without. My hacking is probably inconsistent there.

In the end, I also kept some ZWJ sequence contractions for things like lime, broken link, etc., assuming that we can support a small-ish number of them. (Will need some work in the sifter tool.)

Once we agree on an approach, we will need to modify the generator code and get the real thing. (And once we agree on that, we need to get it into DUCET input format (unidata.txt.)

For trying this out, either build an ICU RuleBasedCollator for the rules, or paste them into the "Append rules" box of the ICU Collation Demo.

& [last primary ignorable]<<*🦰🦱🦳🦲🏻🏼🏽🏾🏿
& [before 1]\uFDD1€
<*😀😃😄😁😆😅🤣😂🙂🙃🫠😉😊😇
<*🥰😍🤩😘😗☺😚😙🥲
<*😋😛😜🤪😝🤑
<*🤗🤭🫢🫣🤫🤔🫡
<*🤐🤨😐😑😶🫥
< 😶‍🌫
<*😏😒🙄😬
< 😮‍💨
<*🤥🫨
< 🙂‍↔
< 🙂‍↕
<*😌😔😪🤤😴
<*😷🤒🤕🤢🤮🤧🥵🥶🥴😵
< 😵‍💫
<*🤯
<*🤠🥳🥸
<*😎🤓🧐
<*😕🫤😟🙁☹😮😯😲😳🥺🥹😦😧😨😰😥😢😭😱😖😣😞😓😩😫🥱
<*😤😡😠🤬😈👿💀☠
<*💩🤡👹👺👻👽👾🤖
<*😺😸😹😻😼😽🙀😿😾
<*🙈🙉🙊
<*💌💘💝💖💗💓💞💕💟❣💔
< ❤‍🔥 = ❤️‍🔥
< ❤‍🩹 = ❤️‍🩹
<*❤🩷🧡💛💚💙🩵💜🤎🖤🩶🤍
<*💋💯💢💥💫💦💨🕳💬
< 👁‍🗨 = 👁️‍🗨
<*🗨🗯💭💤
<*👋🤚🖐✋🖖🫱🫲🫳🫴🫷🫸
<*👌🤌🤏✌🤞🫰🤟🤘🤙
<*👈👉👆🖕👇☝🫵
<*👍👎✊👊🤛🤜
<*👏🙌🫶👐🤲🤝🙏
<*✍💅🤳
<*💪🦾🦿🦵🦶👂🦻👃🧠🫀🫁🦷🦴👀👁👅👄🫦
<*👶🧒👦👧🧑👱👨🧔
<*👩
<*🧓👴👵
<*🙍
<*🙎
<*🙅
<*🙆
<*💁
<*🙋
<*🧏
<*🙇
<*🤦
<*🤷
<*👮
<*🕵
<*💂
<*🥷👷
<*🫅🤴👸👳
<*👲🧕🤵
<*👰
<*🤰🫃🫄🤱
<*👼🎅🤶
<*🦸
<*🦹
<*🧙
<*🧚
<*🧛
<*🧜
<*🧝
<*🧞
<*🧟
<*🧌
<*💆
<*💇
<*🚶
<*🧍
<*🧎
<*🏃
<*💃🕺🕴👯
<*🧖
<*🧗
<*🤺🏇⛷🏂🏌
<*🏄
<*🚣
<*🏊
<*⛹
<*🏋
<*🚴
<*🚵
<*🤸
<*🤼
<*🤽
<*🤾
<*🤹
<*🧘
<*🛀🛌
<*💏
<*💑
<*🗣👤👥🫂👪
<*👣
<*🦰🦱🦳🦲
<*🐵🐒🦍🦧🐶🐕🦮
< 🐕‍🦺
<*🐩🐺🦊🦝🐱🐈
< 🐈‍⬛
<*🦁🐯🐅🐆🐴🫎🫏🐎🦄🦓🦌🦬🐮🐂🐃🐄🐷🐖🐗🐽🐏🐑🐐🐪🐫🦙🦒🐘🦣🦏🦛🐭🐁🐀🐹🐰🐇🐿🦫🦔🦇🐻
< 🐻‍❄
<*🐨🐼🦥🦦🦨🦘🦡🐾
<*🦃🐔🐓🐣🐤🐥🐦🐧🕊🦅🦆🦢🦉🦤🪶🦩🦚🦜🪽
< 🐦‍⬛
<*🪿
< 🐦‍🔥
<*🐸
<*🐊🐢🦎🐍🐲🐉🦕🦖
<*🐳🐋🐬🦭🐟🐠🐡🦈🐙🐚🪸🪼
<*🐌🦋🐛🐜🐝🪲🐞🦗🪳🕷🕸🦂🦟🪰🪱🦠
<*💐🌸💮🪷🏵🌹🥀🌺🌻🌼🌷🪻
<*🌱🪴🌲🌳🌴🌵🌾🌿☘🍀🍁🍂🍃🪹🪺🍄
<*🍇🍈🍉🍊🍋
< 🍋‍🟩
<*🍌🍍🥭🍎🍏🍐🍑🍒🍓🫐🥝🍅🫒🥥
<*🥑🍆🥔🥕🌽🌶🫑🥒🥬🥦🧄🧅🥜🫘🌰🫚🫛
< 🍄‍🟫
<*🍞🥐🥖🫓🥨🥯🥞🧇🧀🍖🍗🥩🥓🍔🍟🍕🌭🥪🌮🌯🫔🥙🧆🥚🍳🥘🍲🫕🥣🥗🍿🧈🧂🥫
<*🍱🍘🍙🍚🍛🍜🍝🍠🍢🍣🍤🍥🥮🍡🥟🥠🥡
<*🦀🦞🦐🦑🦪
<*🍦🍧🍨🍩🍪🎂🍰🧁🥧🍫🍬🍭🍮🍯
<*🍼🥛☕🫖🍵🍶🍾🍷🍸🍹🍺🍻🥂🥃🫗🥤🧋🧃🧉🧊
<*🥢🍽🍴🥄🔪🫙🏺
<*🌍🌎🌏🌐🗺🗾🧭
<*🏔⛰🌋🗻🏕🏖🏜🏝🏞
<*🏟🏛🏗🧱🪨🪵🛖🏘🏚🏠🏡🏢🏣🏤🏥🏦🏨🏩🏪🏫🏬🏭🏯🏰💒🗼🗽
<*⛪🕌🛕🕍⛩🕋
<*⛲⛺🌁🌃🏙🌄🌅🌆🌇🌉♨🎠🛝🎡🎢💈🎪
<*🚂🚃🚄🚅🚆🚇🚈🚉🚊🚝🚞🚋🚌🚍🚎🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🛻🚚🚛🚜🏎🏍🛵🦽🦼🛺🚲🛴🛹🛼🚏🛣🛤🛢⛽🛞🚨🚥🚦🛑🚧
<*⚓🛟⛵🛶🚤🛳⛴🛥🚢
<*✈🛩🛫🛬🪂💺🚁🚟🚠🚡🛰🚀🛸
<*🛎🧳
<*⌛⏳⌚⏰⏱⏲🕰🕛🕧🕐🕜🕑🕝🕒🕞🕓🕟🕔🕠🕕🕡🕖🕢🕗🕣🕘🕤🕙🕥🕚🕦
<*🌑🌒🌓🌔🌕🌖🌗🌘🌙🌚🌛🌜🌡☀🌝🌞🪐⭐🌟🌠🌌☁⛅⛈🌤🌥🌦🌧🌨🌩🌪🌫🌬🌀🌈🌂☂☔⛱⚡❄☃⛄☄🔥💧🌊
<*🎃🎄🎆🎇🧨✨🎈🎉🎊🎋🎍🎎🎏🎐🎑🧧🎀🎁🎗🎟🎫
<*🎖🏆🏅🥇🥈🥉
<*⚽⚾🥎🏀🏐🏈🏉🎾🥏🎳🏏🏑🏒🥍🏓🏸🥊🥋🥅⛳⛸🎣🤿🎽🎿🛷🥌
<*🎯🪀🪁🔫🎱🔮🪄🎮🕹🎰🎲🧩🧸🪅🪩🪆♠♥♦♣♟🃏🀄🎴
<*🎭🖼🎨🧵🪡🧶🪢
<*👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩱🩲🩳👙👚🪭👛👜👝🛍🎒🩴👞👟🥾🥿👠👡🩰👢🪮👑👒🎩🎓🧢🪖⛑📿💄💍💎
<*🔇🔈🔉🔊📢📣📯🔔🔕
<*🎼🎵🎶🎙🎚🎛🎤🎧📻
<*🎷🪗🎸🎹🎺🎻🪕🥁🪘🪇🪈
<*📱📲☎📞📟📠
<*🔋🪫🔌💻🖥🖨⌨🖱🖲💽💾💿📀🧮
<*🎥🎞📽🎬📺📷📸📹📼🔍🔎🕯💡🔦🏮🪔
<*📔📕📖📗📘📙📚📓📒📃📜📄📰🗞📑🔖🏷
<*💰🪙💴💵💶💷💸💳🧾💹
<*✉📧📨📩📤📥📦📫📪📬📭📮🗳
<*✏✒🖋🖊🖌🖍📝
<*💼📁📂🗂📅📆🗒🗓📇📈📉📊📋📌📍📎🖇📏📐✂🗃🗄🗑
<*🔒🔓🔏🔐🔑🗝
<*🔨🪓⛏⚒🛠🗡⚔💣🪃🏹🛡🪚🔧🪛🔩⚙🗜⚖🦯🔗
< ⛓‍💥 = ⛓️‍💥
<*⛓🪝🧰🧲🪜
<*⚗🧪🧫🧬🔬🔭📡
<*💉🩸💊🩹🩼🩺🩻
<*🚪🛗🪞🪟🛏🛋🪑🚽🪠🚿🛁🪤🪒🧴🧷🧹🧺🧻🪣🧼🫧🪥🧽🧯🛒
<*🚬⚰🪦⚱🧿🪬🗿🪧🪪
<*🏧🚮🚰♿🚹🚺🚻🚼🚾🛂🛃🛄🛅
<*⚠🚸⛔🚫🚳🚭🚯🚱🚷📵🔞☢☣
<*⬆↗➡↘⬇↙⬅↖↕↔↩↪⤴⤵🔃🔄🔙🔚🔛🔜🔝
<*🛐⚛🕉✡☸☯✝☦☪☮🕎🔯🪯
<*♈♉♊♋♌♍♎♏♐♑♒♓⛎
<*🔀🔁🔂▶⏩⏭⏯◀⏪⏮🔼⏫🔽⏬⏸⏹⏺⏏🎦🔅🔆📶🛜📳📴
<*♀♂⚧
<*✖➕➖➗🟰♾
<*‼⁉❓❔❕❗〰
<*💱💲
<*⚕♻⚜🔱📛🔰⭕✅☑✔❌❎➰➿〽✳✴❇©®™
< '#⃣' = '#️⃣'
< '*⃣' = '*️⃣'
< 0⃣ = 0️⃣
< 1⃣ = 1️⃣
< 2⃣ = 2️⃣
< 3⃣ = 3️⃣
< 4⃣ = 4️⃣
< 5⃣ = 5️⃣
< 6⃣ = 6️⃣
< 7⃣ = 7️⃣
< 8⃣ = 8️⃣
< 9⃣ = 9️⃣
<*🔟
<*🔠🔡🔢🔣🔤🅰🆎🅱🆑🆒🆓ℹ🆔Ⓜ🆕🆖🅾🆗🅿🆘🆙🆚🈁🈂🈷🈶🈯🉐🈹🈚🈲🉑🈸🈴🈳㊗㊙🈺🈵
<*🔴🟠🟡🟢🔵🟣🟤⚫⚪🟥🟧🟨🟩🟦🟪🟫⬛⬜◼◻◾◽▪▫🔶🔷🔸🔹🔺🔻💠🔘🔳🔲
<*🏁🚩🎌🏴🏳
< 🏳‍🌈 = 🏳️‍🌈
< 🏳‍⚧ = 🏳️‍⚧
< 🏴‍☠
<*🇦🇧🇨🇩🇪🇫🇬🇭🇮🇯🇰🇱🇲🇳🇴🇵🇶🇷🇸🇹🇺🇻🇼🇽🇾🇿
< 🏴󠁧󠁢󠁥󠁮󠁧󠁿
< 🏴󠁧󠁢󠁳󠁣󠁴󠁿
< 🏴󠁧󠁢󠁷󠁬󠁳󠁿