Open Ms2ger opened 8 years ago
@Ms2ger Thank you for the pointer. I haven't read the whole thing, but I will reply to the particular section mentioning the shortcoming (or design differences) of Encoding here:
StringWriter
etc. is already generic, with a slight preference to UTF-8 though).no_std
environments.) This is also the reason that I want to separate the low-level API (finer-grained streaming and allocation) from the mid-level API (streaming only), which is the main feature of long-delayed 0.3.0.Ultimately, I won't release 1.0.0 without those cases addressed (or explicitly rejected). I had no energy to work on them for some months (fortunately for me Encoding seems to be a very stable library), but I can definitely say that they are mostly within a reach and if someone is interested I can always give a required permission.
I'm open to any suggestion to further reduce the data table without a significant overhead
I only briefly looked at one example, but that should be replaceable with an FST since you're using it as a map. (As FST will do compression on both the keys and values, lookup is proportional to the key size and initializing an FST in memory that has already been created is basically free and trivial to do with lazy_static!
.)
@Ms2ger:
Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).
First of all, I think it's completely reasonable from @lifthrasiir's perspective not to make changes and not allow me to make changes because of what I'd like to have for Gecko. In general, from the perspective of a package maintainer, it's not a good idea to let some person on the Internet rewrite the package to their requirements or request that the package be rewritten to their requirements. It probably makes sense for everyone to wait and see if I actually get some code written according to my design.
As for reconciling what I'd like to have for Gecko and having such a framework expose the current rust-encoding API, I think the fundamental API-level mismatches are:
ByteWriter
or a StringWriter
, so reconciling the approaches would invite the API user to potentially use the API in way that'd be harmful for performance. It wouldn't be not cool to set up such a gotcha for the API user, so if the internals were the kind of internals that I think make sense for Gecko, the API compatibility should probably be limited to String
and Vec<u8>
instead of any StringWriter
or ByteWriter
.DecoderTrap
. There exists a use case (Validator.nu!) for decoders that indicate which bytes were in error and are, nonetheless, streaming. Having both correct identification of the entire malformed byte sequence and having a streaming API (appropriate for Gecko) is a combination that's just asking for a bad balance between complexity and benefit. (java.nio
tries to have that combination, fails to document it properly and once you've figured out how it works, it's annoying to use--especially if all you want to do is to treat the errors as fatal.) For Gecko, I want decoders that have a reasonable API that doesn't try to indicate which bytes were in error even if it means that some super-cool View Source error reporting is forever impossible.@lifthrasiir:
This is also the reason that I want to separate the low-level API (finer-grained streaming and allocation) from the mid-level API (streaming only), which is the main feature of long-delayed 0.3.0.
Is there a design doc saying what this would look like?
Is there the data size comparison?
The only way to get accurate numbers is to build the alternative and compare. It's possible to estimate the table sizes, but that doesn't take into account the size of the more complex lookup code.
But for starters, I'd get rid of all the encode-side index data structures and search the decode-optimized data structures in reverse, which should get rid of at least half of the footprint.
By compression I don't mean stuff like Deflate or Brotli. I mean things like storing only a quarter of Windows-1252 as a lookup table and three quarters as offsets that happen to be zero and omitting ranges of uninteresting index slots as seen in the Gecko Big5 data.
By compression I don't mean stuff like Deflate or Brotli.
I don't either. :-) An FST does "compression" by virtue of representing a map as a finite state machine and minimizing the machine. Just throwing it out there---I certainly don't know whether it fairs better or worse than something more domain specific as you suggest.
@hsivonen @Ms2ger @BurntSushi
Okay, I've done some research in the table size in my spare time. (The API thing is a bit harder to research, and while I have a brief design idea and some code from past months, it would take some more weeks til I can come up with a concrete code or proposal.)
Specifically, I've tested three methods for table "compression":
min(4, 2^(4-w))
entries per a group of 2^w
entries, and is represented by one control word (has #entries * w
bits for indices, hence 2^(4-w)
limit) and multiple data words.2^w
entries is composed of multiple blocks: either gap-data block (1+n
words) or gap-run block (2 words). w
is valid up to 8, but I sticked to w = 6
to avoid too much linear search.A crude modification to gen_index.py
is available here. I've also tried probably dozens of other methods, but sticked to them as they were easy to implement in the Rust code and gave a good balance between the code and the size.
The result is as follows. As a comparison I've also put sizes of Gecko's current mapping tables below:
Index | Gecko | Original | Method 1 | Method 2 | Method 3 |
---|---|---|---|---|---|
ibm866 | 570 | 1276 | 1276 | 948 | 694 |
iso-8859-2 | 656 | 636 | 636 | 602 | 452 |
iso-8859-3 | 534 | 640 | 640 | 522 | 396 |
iso-8859-4 | 702 | 622 | 622 | 604 | 431 |
iso-8859-5 | 154 | 842 | 842 | 649 | 552 |
iso-8859-6 | 166 | 550 | 550 | 463 | 335 |
iso-8859-7 | 266 | 1068 | 1068 | 780 | 578 |
iso-8859-8 | 194 | 834 | 834 | 710 | 544 |
iso-8859-10 | 630 | 898 | 898 | 645 | 655 |
iso-8859-13 | 700 | 898 | 898 | 771 | 682 |
iso-8859-14 | 494 | 1168 | 1168 | 1050 | 600 |
iso-8859-15 | 202 | 838 | 838 | 647 | 562 |
iso-8859-16 | 578 | 1132 | 1132 | 778 | 650 |
koi8-r | 744 | 1372 | 1372 | 1164 | 714 |
koi8-u | 856 | 1404 | 1404 | 1167 | 744 |
macintosh | 804 | 3418 | 3418 | 1919 | 2475 |
windows-874 | 190 | 902 | 902 | 777 | 559 |
windows-1250 | 882 | 1098 | 1098 | 911 | 742 |
windows-1251 | 446 | 1034 | 1034 | 787 | 632 |
windows-1252 | 302 | 1204 | 1204 | 788 | 599 |
windows-1253 | 460 | 1140 | 1140 | 785 | 621 |
windows-1254 | 412 | 1268 | 1268 | 784 | 616 |
windows-1255 | 504 | 1236 | 1236 | 792 | 619 |
windows-1256 | 730 | 1364 | 1364 | 921 | 685 |
windows-1257 | 940 | 1098 | 1098 | 911 | 748 |
windows-1258 | 684 | 1332 | 1332 | 1094 | 663 |
x-mac-cyrillic | 476 | 1320 | 1320 | 996 | 663 |
big5 | N/A | 151232 | 151232 | 111578 | 92544 |
big5 (*) | 37856 | 37680 | 37680 | 37680 | 37680 |
euc-kr | 94356 | 118860 | 105736 | 86830 | 66588 |
gb18030 | 94374 | 106760 | 106760 | 103560 | 81782 |
jis0208 (#) | 95070 | 71464 | 65144 | 53834 | 41512 |
jis0212 | N/A | 59130 | 57646 | 46368 | 33180 |
jis0212 (*) | 12784 | 14206 | 12722 | 12722 | 12722 |
Total | N/A | 538038 | 517110 | 425135 | 338571 |
Total (*) | 348716 | 379558 | 358630 | 317591 | 263249 |
(*) Gecko has no or very slow encoder implementation for these indices. Corresponding backward mapping in Encoding has been also omitted. (#) Gecko has a duplicate sjis/euc-jp mapping, thus has been penalized here.
Gecko surely beats Encoding, but not that much, especially given that it lacks an encoding table for Big5 and JIS X 0212. When they have been taken into account Encoding is only ~10% bigger than Gecko, and much more compression is possible when encoding legacy encodings don't need to be that fast (I think it is the case given that Big5 encoding is sooooo slow, but I would be glad to be clarified).
It is also noted that Gecko's current data structure is not very well optimized. Gecko does a binary search through blocks of consecutive data values (which may be optimized for one lone entry or values with a constant difference to indices), while Encoding relies to a simple lookup table (plus bitset for Big5) for forward mapping and a two-level trie for backward mapping. Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.
Given all this data, I feel that the table size concern is mostly a myth (fortunately!) and it is more about the matter of penalizing encoding legacy encodings. What do you think?
Given all this data, I feel that the table size concern is mostly a myth (fortunately!)
How does comparing to current Gecko result in that conclusion? Except for Big5, Gecko currently has tables for both encode and decode. My proposal is to have only decode tables in Gecko in the future (as seen with Big5 today) and searching the decode table (linearly) on the encoder side. This should, trivially, save the size of the encoder tables (while making encode slower). I.e. make the resulting encode-only tables smaller that the encode+decode tables in either current Gecko or in rust-encoding.
Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.
I don't have my analysis script on this computer, but when I looked at consecutive index slots that map to consecutive code points in CJK a couple of weeks back, I noticed that there aren't massive such ranges. (This surprised me, since I had expected Unicode to have taken non-Kanji/Hanja parts of old Japanese and Korean standards more directly.) Therefore, there don't seem to be big wins available from doing offsets in CJK, since the bookkeeping data grows large.
I haven't done a proper analysis of the single-byte encodings yet, but I see more opportunity there by splitting the 7 bits (of the upper non-ASCII half) to a 3-bit prefix and a 4-bit suffix and indexing to a table of 8 16-bit values using the 3-bit prefix to find either a value whose most-significant bit tags it as a code point offset, so that you take the offset plus the 4-bit suffix and use that as a code point, or as a lookup index offset, so you take the offset plus the 4-bit suffix and use that to index into a lookup table common across all single-byte encodings to find a code point.
It's unclear to me if this is worth it, since the single-byte data isn't that large to begin with.
To clarify the thinking behind my single-byte idea:
It obviously adds overhead of 432 bytes of data plus some bytes of more instructions. What it would save is stuff like lookup tables for A0 upwards for windows-1252 and 9F downwards for various ISO-8859-* encodings as well as stuff like E0 to EF for ISO-8859-8 and various similar ranges in the Cyrillic encodings. Additionally, the part about having a common lookup table would compress away some duplicate lookup table ranges between similar encodings.
Again, I'm not sure if the savings are worth the trouble.
@hsivonen
(To avoid any confusion, the preceding comment was made without fully reading the proposal itself. I have since read that and understand why Big5 is that slow and comparing against the current Gecko is a bit pointless there.)
I now have a working (fully tested and benchmarked) version of Encoding with smaller backward mappings. The mapping function is now at most 50x to 150x slower than the original version (the worst case is for the unsupported characters), while the actual encoding process (that's what it is actually tuned for) is only about 10x slower because the sample texts have no error and contains many ASCII characters as well. As a ballpark figure, the corresponding results for the latter are as follows:
ORIGINAL
japanese::eucjp_tests::decode 4,382 ns/iter (+/- 2,150) = 137 MB/s
japanese::eucjp_tests::encode 3,797 ns/iter (+/- 2,945) = 237 MB/s
japanese::iso2022jp_tests::decode 4,538 ns/iter (+/- 3,892) = 141 MB/s
japanese::iso2022jp_tests::encode 4,442 ns/iter (+/- 1,234) = 202 MB/s
japanese::windows31j_tests::decode 5,404 ns/iter (+/- 5,289) = 111 MB/s
japanese::windows31j_tests::encode 4,717 ns/iter (+/- 3,802) = 190 MB/s
korean::windows949_tests::decode 5,148 ns/iter (+/- 1,876) = 123 MB/s
korean::windows949_tests::encode 4,116 ns/iter (+/- 4,521) = 217 MB/s
simpchinese::gb18030_tests::decode 5,098 ns/iter (+/- 4,435) = 142 MB/s
simpchinese::gb18030_tests::encode 3,979 ns/iter (+/- 3,264) = 273 MB/s
simpchinese::gbk_tests::encode 4,686 ns/iter (+/- 1,338) = 232 MB/s
simpchinese::hz_tests::decode 5,380 ns/iter (+/- 5,011) = 135 MB/s
simpchinese::hz_tests::encode 5,064 ns/iter (+/- 3,877) = 215 MB/s
tradchinese::bigfive2003_tests::decode 6,367 ns/iter (+/- 2,400) = 114 MB/s
tradchinese::bigfive2003_tests::encode 4,607 ns/iter (+/- 1,453) = 236 MB/s
REDUCED (OLD)
japanese::eucjp_tests::decode 5,226 ns/iter (+/- 1,289) = 115 MB/s
japanese::eucjp_tests::encode 53,703 ns/iter (+/- 15,902) = 16 MB/s
japanese::iso2022jp_tests::decode 5,682 ns/iter (+/- 4,556) = 112 MB/s
japanese::iso2022jp_tests::encode 54,233 ns/iter (+/- 24,292) = 16 MB/s
japanese::windows31j_tests::decode 5,230 ns/iter (+/- 3,271) = 115 MB/s
japanese::windows31j_tests::encode 52,661 ns/iter (+/- 23,542) = 17 MB/s
korean::windows949_tests::decode 5,207 ns/iter (+/- 3,161) = 122 MB/s
korean::windows949_tests::encode 49,466 ns/iter (+/- 6,506) = 18 MB/s
simpchinese::gb18030_tests::decode 5,393 ns/iter (+/- 5,565) = 134 MB/s
simpchinese::gb18030_tests::encode 113,724 ns/iter (+/- 38,308) = 9 MB/s
simpchinese::gbk_tests::encode 110,458 ns/iter (+/- 68,267) = 9 MB/s
simpchinese::hz_tests::decode 4,770 ns/iter (+/- 4,911) = 152 MB/s
simpchinese::hz_tests::encode 118,813 ns/iter (+/- 33,223) = 9 MB/s
tradchinese::bigfive2003_tests::decode 6,032 ns/iter (+/- 1,880) = 120 MB/s
tradchinese::bigfive2003_tests::encode 70,489 ns/iter (+/- 53,987) = 15 MB/s
REDUCED (NEW)
japanese::eucjp_tests::decode 5,793 ns/iter (+/- 5,017) = 104 MB/s
japanese::eucjp_tests::encode 52,997 ns/iter (+/- 26,588) = 16 MB/s
japanese::iso2022jp_tests::decode 6,132 ns/iter (+/- 4,216) = 104 MB/s
japanese::iso2022jp_tests::encode 48,047 ns/iter (+/- 11,639) = 18 MB/s
japanese::windows31j_tests::decode 5,449 ns/iter (+/- 4,586) = 110 MB/s
japanese::windows31j_tests::encode 48,734 ns/iter (+/- 14,315) = 18 MB/s
korean::windows949_tests::decode 5,713 ns/iter (+/- 2,281) = 111 MB/s
korean::windows949_tests::encode 42,401 ns/iter (+/- 18,249) = 21 MB/s
simpchinese::gb18030_tests::decode 5,144 ns/iter (+/- 3,920) = 141 MB/s
simpchinese::gb18030_tests::encode 74,920 ns/iter (+/- 81,547) = 14 MB/s
simpchinese::gbk_tests::encode 74,454 ns/iter (+/- 26,760) = 14 MB/s
simpchinese::hz_tests::decode 5,179 ns/iter (+/- 4,108) = 140 MB/s
simpchinese::hz_tests::encode 75,950 ns/iter (+/- 66,161) = 14 MB/s
tradchinese::bigfive2003_tests::decode 6,232 ns/iter (+/- 203) = 116 MB/s
tradchinese::bigfive2003_tests::encode 64,160 ns/iter (+/- 35,266) = 16 MB/s
The table size is as follows: (the parenthesized numbers)
[omitted]
generating index big5... (cached) 37680 + 113552 (14694) = 151232 (52374) bytes.
generating index euc-kr... (cached) 34376 + 71360 (7638) = 105736 (42014) bytes.
generating index gb18030... (cached) 47880 + 57216 (8556) = 105096 (56436) bytes.
generating index jis0208... (cached) 15888 + 49256 (5698) = 65144 (21586) bytes.
generating index jis0212... (cached) 12722 + 44924 (618) = 57646 (13340) bytes.
generating index gb18030-ranges... (cached) 832 + 832 (832) = 1664 (1664) bytes.
total 517110 (194326) bytes.
For the single-byte indices a plain linear search is applied because there is not much point optimizing 128-entry linear search. (It took only 20KB of table for corresponding optimized mappings, after all.) For the multi-byte indices the traditional two-layer trie is used, where the lower layer contains a list of index ranges s <= ... < e
that may correspond to given char
value. The number of total index entries scanned is limited to 256. I've also experimented with 512, 1024 and 4096 and concluded that they are not really worth. For example, 512 reduces the table by 9KB at the expense of 2--3x slowdown.
EDIT: I've special-cased a single entry range so that the table size does not change but cache miss is reduced a lot (I haven't noticed this early because I originally focused on its impact on the data size). This optimization improves GB 18030 a lot.
Does this trade-off look fine to you?
Why do the size reduction slow down decode, too?
As for what tradeoff level is right for encode, it's hard to say. The way I'm looking at it, it's good to reduce the size legacy encodings take in Gecko as much as is possible without making non-UTF-8 pages that have unescaped non-ASCII in URLs take a noticeable perf hit or non-UTF-8 form submission with normal amounts of the sort of text that the user would realistically enter into the form noticeably slow.
The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable. Of course, everything is perf-sensitive if perf becomes user-perceivably awful.
If it's easy to parametrize the tradeoff at compile time, one option would be zapping encode table totally on Android and letting there be some encode acceleration elsewhere.
@hsivonen
Why do the size reduction slow down decode, too?
Partly because the "reduced (old)" and the "reduced (new)" benchmarks were not run at the same time---decoding performance was mostly same for them, while encoding performance is improved as I've indicated. If you mind that I'll update the results, but keep in mind that cargo bench
is not very reliable in producing reproducible results.
Partly because the forward mapping table is "compressed" (actually, remapped) to avoid large gaps (as you have suggested in the proposal)---Japanese and Korean encodings have affected and slowed down by about 25%. This corresponds to the size difference between "original" and "method 1" above (about 20KB). I'm not sure if taking this remapping out improves or degrades the performance.
The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable.
Does the uncompressed app size matter here? Otherwise you can make the table data more compressible (by LZ77, obviously) by simply encoding a difference value instead, i.e. storing x - f(x)
instead of f(x)
. While you are correct that there is no "big" consecutive region in those tables, I was able to exploit smaller strips of consecutive regions (e.g. KS X 1001 Hangul, GB 18030 GBK compat mapping), so that may be a good short-term solution.
I also wonder how much you can get with (say) 200KB of data and code. (That's the entire size of backward mappings in Gecko, FYI.) Any pointer would be appreciated.
If it's easy to parametrize the tradeoff at compile time, one option would be zapping encode table totally on Android and letting there be some encode acceleration elsewhere.
gen_index.py
takes about 30 seconds in my laptop. The integration to the build process is thus easy (I probably need to expose some options though). At the moment the backward mapping table weighs 38KB, and it is possible to further optimize that at expense of further performance degradation.
If you agree to this plan I can commit a version of Encoding with customizable gen_index.py
soon (probably tomorrow or so, I've got an one-week vacation til the end of year :).
(Just to be clear, I've pushed the no-optimized-legacy-encoding
feature as 726d6dae7c910a3342a187e5618b8d7e128d03ff.)
@lifthrasiir What's the unit of your table? Byte Or Bit?
@lygstate IEEE 1541-2002 standardizes on B
as an abbreviation of Byte.
@lifthrasiir
@hsivonen @Ms2ger @BurntSushi
Okay, I've done some research in the table size in my spare time. (The API thing is a bit harder to research, and while I have a brief design idea and some code from past months, it would take some more weeks til I can come up with a concrete code or proposal.)
Specifically, I've tested three methods for table "compression":
Remapping. The original forward (index-to-char) mapping has some big holes to simplify the decoder description; the remapping code has been introduced to euc-kr, jis0208 and jis0212 indices to reduce the gap.
Remapping plus linear search for the lower part of the backward (char-to-index) mapping trie. The linear search is enabled when there are at most min(4, 2^(4-w)) entries per a group of 2^w entries, and is represented by one control word (has #entries * w bits for indices, hence 2^(4-w) limit) and multiple data words.
Remapping plus aggressive linear search for consecutive data words. Similar to method 2, but in this time a group of 2^w entries is composed of multiple blocks: either gap-data block (1+n words) or gap-run block (2 words). w is valid up to 8, but I sticked to w = 6 to avoid too much linear search.
A crude modification to gen_index.py is available here. I've also tried probably dozens of other methods, but sticked to them as they were easy to implement in the Rust code and gave a good balance between the code and the size.
The result is as follows. As a comparison I've also put sizes of Gecko's current mapping tables below:
Index Gecko Original Method 1 Method 2 Method 3
ibm866 570 1276 1276 948 694
iso-8859-2 656 636 636 602 452
iso-8859-3 534 640 640 522 396
iso-8859-4 702 622 622 604 431
iso-8859-5 154 842 842 649 552
iso-8859-6 166 550 550 463 335
iso-8859-7 266 1068 1068 780 578
iso-8859-8 194 834 834 710 544
iso-8859-10 630 898 898 645 655
iso-8859-13 700 898 898 771 682
iso-8859-14 494 1168 1168 1050 600
iso-8859-15 202 838 838 647 562
iso-8859-16 578 1132 1132 778 650
koi8-r 744 1372 1372 1164 714
koi8-u 856 1404 1404 1167 744
macintosh 804 3418 3418 1919 2475
windows-874 190 902 902 777 559
windows-1250 882 1098 1098 911 742
windows-1251 446 1034 1034 787 632
windows-1252 302 1204 1204 788 599
windows-1253 460 1140 1140 785 621
windows-1254 412 1268 1268 784 616
windows-1255 504 1236 1236 792 619
windows-1256 730 1364 1364 921 685
windows-1257 940 1098 1098 911 748
windows-1258 684 1332 1332 1094 663
x-mac-cyrillic 476 1320 1320 996 663
big5 N/A 151232 151232 111578 92544
big5 (*) 37856 37680 37680 37680 37680
euc-kr 94356 118860 105736 86830 66588
gb18030 94374 106760 106760 103560 81782
jis0208 (#) 95070 71464 65144 53834 41512
jis0212 N/A 59130 57646 46368 33180
jis0212 (*) 12784 14206 12722 12722 12722
Total N/A 538038 517110 425135 338571
Total (*) 348716 379558 358630 317591 263249
(*) Gecko has no or very slow encoder implementation for these indices. Corresponding backward mapping in Encoding has been also omitted.
(#) Gecko has a duplicate sjis/euc-jp mapping, thus has been penalized here.
Gecko surely beats Encoding, but not that much, especially given that it lacks an encoding table for Big5 and JIS X 0212. When they have been taken into account Encoding is only ~10% bigger than Gecko, and much more compression is possible when encoding legacy encodings don't need to be that fast (I think it is the case given that Big5 encoding is sooooo slow, but I would be glad to be clarified).
It is also noted that Gecko's current data structure is not very well optimized. Gecko does a binary search through blocks of consecutive data values (which may be optimized for one lone entry or values with a constant difference to indices), while Encoding relies to a simple lookup table (plus bitset for Big5) for forward mapping and a two-level trie for backward mapping. Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.
Given all this data, I feel that the table size concern is mostly a myth (fortunately!) and it is more about the matter of penalizing encoding legacy encodings. What do you think?
Tell me, where is the B
@lygstate Wait, wasn't that about this comment? Anyway, in that table the unit is byte (implied in other ways, but to be clear), but I wouldn't never use bits as a unit there.
@lifthrasiir
Whoa. I had written a reply already. I have no idea why the submission has apparently failed. Sorry. Trying again:
Does the uncompressed app size matter here?
Yes. Storage on cheap devices is sometimes cited as a problem, but in any case, libxul.so is not compressed in the APK (IIRC to make it faster to access at run-time), so the code is delivered over the network and stored on the device as uncompressed. (Yes, it's sad.)
I also wonder how much you can get with (say) 200KB of data and code. (That's the entire size of backward mappings in Gecko, FYI.) Any pointer would be appreciated.
Bug 864843 was the original bug where landing of ICU (converters excluded already) was blocked on Android over size concerns despite the non-landing causing badness in various places in the codebase because we can't categorically get rid of Netscape-era legacy code, because Android uses the legacy code instead of ICU for collation, etc. The bug is now marked as FIXED, because the bug morphed in scope to cover b2gdroid only. The remains of the original scope moved to bug 1215247.
As can be seen in those bugs, ICU is over 3 MB increase, so a 200 KB decrease cannot buy enough space for ICU. On one hand, there is no single 3 MB decrease to be found, so finding 200 KB in one place in already a lot. On the other hand, if it's not enough, then there's a possibility that the saving will go to waste (i.e. space by random careless stuff).
I expect there's not going to be a 3 MB decrease, but I was planning on arguing that we should take ICU after making an effort to make some real savings, after documenting more thoroughly the badness of having to keep around Netscape-era non-ICU code paths just for Android and after Web apps starting to expect the ICU-enabled features to be there in Chrome for Android.
As for whether it makes sense to have something faster than linear search for the encoder side at a relatively small space cost, I don't have an idea of how much space cost is appropriate for what speedup. My current thinking is that if linear search is good enough for the use cases, then any optimization is YAGNI. My current thinking may well be misguided and I haven't proven the "good enough" part.
Yesterday's conversation with @hsivonen from #rust
convinced me of the lowest-level redesign. I already agreed that the half-filled output is problematic (that is partly a reason that the current stateful decoder is complex), but wasn't sure about other reasons (turned out to be an excessive number of virtual calls). Relevant logs:
<Yurume> hsivonen: ah, I think I have asked about the specific API design of your encoding-rs proposal. pasting the log:
<Yurume> ii) I don't understand exactly why self-growing buffer (besides from no UTF-16) is bad. is it just a callback avoidance, or is there any more important reason?
<hsivonen> Yurume: if you allow custom self-growing buffers, the write to the buffer is a virtual call, right?
<Yurume> in the current Encoding API it would be.
<hsivonen> Yurume: also, it prevents the caller from allocating one buffer, e.g. one memory page, and recycling it
<hsivonen> Yurume: thirdly, when the caller is OK with segmenting the output across multiple buffers, it causes useless copying
<hsivonen> Yurume: for example, in Gecko, the HTML parser has a linked list of UTF-16 buffers
<hsivonen> Yurume: so it doesn't want the decoder to realloc. Instead, it wants to add a buffer to its linked list on its own
<Yurume> virtual call concern is what I have indeed missed (thank you for the answer!). I think a careful API design may solve recycling situation and multiple buffers, but that would give more burden to the caller.
<Yurume> hsivonen: make sure that any identifier that is free in the macro body is lexically available at the position of macro definition itself
<hsivonen> Yurume: aside: it looked to me like your single-byte decoder had a virtual call in the inner loop
<Yurume> I think so
<Yurume> the current implementation is not really very tuned
<Yurume> mostly because, well, I didn't have a real-world situation that pushes the boundary
C:\CI-Cor\ks\iconv-full>babel-node generation\fm-dbcs.js
Need space for big5 is 62246 Bytes, table count: 2
Need space for gbk is 50014 Bytes, table count: 2
Need space for jis0208 is 33930 Bytes, table count: 3
Need space for jis0212 is 28108 Bytes, table count: 1
Need space for euc-kr is 48994 Bytes, table count: 1
DBCS encodings regenerated.
Now I have investigated a new way to do encoding convert by succinct data structure. It's even better than the Method 3 that presented before. There is no compression on the data, The time complexity for each table are O(logN) + 4 The 4 comes from log(65536)/log(log(65536) ) N is the the whole compressed array.
The main idea comes from http://arxiv.org/pdf/0902.1038.pdf And because all the code points count are less than 32768 So the space need is (N <= 32768) (NlogN + N)Bits = 16N Bits = 2NByes for all the permutation.
@lygstate The figure looks interesting, but please note however that the method 3 above was a proof-of-concept that was much slower than the current implementation; you should look at this comment to see the current figure.
Your scheme seems to have 223K total data compared to 486K optimal (that really equals to "method 1" above) and 185K traded-off (10x slower); it wouldn't match @hsivonen's intended use case anyway (the required encoding perf is bounded by network I/O so 10x slowdown is probably acceptable) and will result in significant slowdown in the "optimized" encoding process. [1] Even if it does not impact the decoding performance (and it should!) it may incur too much complexity for the size reduction. Ultimately that approach requires the actual Rust code that can be tested and benchmarked. I'll gladly review and test such a PR.
[1] My wild guess is that your scheme will be 3--5x slower depending on the trade-off; the current trie-based encoding is essentially two memory ops plus a handful number of arithmetic operations. The succinct dictionary that forms the basis of wavelet tree already requires three or more memory ops, and that's just to locate the easily-compressible run including that index (if you have faithfully implemented the paper). My personal threshold for encoding slowdown for the "optimized" build is, by comparison, probably 1.5--2x (about 100MB/s in my fastest box). [2]
[2] Of course, as I've indicated above, the encoder and decoder using those functions are not super-optimized, so this figure may change.
@lifthrasiir I didn't test the performance yet, hot there won't be a big slowdown.
@lygstate If you think there won't be a big slowdown, please file a PR showing that.
@lifthrasiir I am working on that.
It probably makes sense for everyone to wait and see if I actually get some code written according to my design.
Now that I've written the code and it has been deployed, I feel I should follow up here.
I'm seeking to have converter internals that write to a caller-provided buffer by treating it as contiguous memory
I think this was the right design decision considering the Gecko use case. It allowed for a clean C++ wrapper and it allowed for SIMD acceleration of the ASCII range, which is very common in HTML/CSS/JS inputs.
There exists a use case (Validator.nu!) for decoders that indicate which bytes were in error and are, nonetheless, streaming. Having both correct identification of the entire malformed byte sequence and having a streaming API (appropriate for Gecko) is a combination that's just asking for a bad balance between complexity and benefit.
During the development of encoding_rs I came up with a solution that imposes no complication on callers that don't care about how errors are attributed to input bytes but still allows for precise attribution for callers that do care (that are then responsible for remembering up to 6 bytes of past input).
The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable
encoding_rs as configured in Firefox ended up shipping without any encode-specific data tables. That is, the legacy encoders work by doing linear search over decode-oriented data tables (except for EUC-KR Hangul, which is nicely enough sorted that binary search works).
I tested that the wall-clock time for encoding to legacy encodings the amount of Japanese or Chinese text that one might expect a user to submit as one form submission was within acceptable wall-clock time on a Raspberry Pi 3, which stood in for a slow phone.
The result was that Gecko's binary size went down despite encoding_rs adding decode to UTF-8 and encode from UTF-8--i.e. the exposed feature set grew.
The above regressed Japanese and Chinese legacy encode speed compared to uconv, so as a backup plan (had the above turned out to be user-visibly slow despite the RPi measurement), I implemented an option (the default for encoding_rs) to have binary search-based encode acceleration tables for encoding ideograph into the legacy Japanese and Chinese encodings. (I figured that ideographs have so little use in the modern Korean context that I didn't do this for EUC-KR.)
Obviously, this is still slower than the approach rust-encoding takes. I don't know if anyone actually wants the half-way approach, but I haven't had a good enough reason to implement yet another approach in order to win encode benchmarks.
(encoding_rs is faster than rust-encoding when decoding HTML content. When encoding plain-text content, encoding_rs is faster for "encode" to UTF-8, which just returns a borrowed Cow
and on the legacy side for encoding Latin-script languages with infrequent non-ASCII, such as English, German, French and Portuguese. For everything else, including Latin-script languages with frequent non-ASCII, such as Turkish and Czech, rust-encoding is faster at encoding to legacy encodings. This result is expect considering the design priorities and could be changed with relative ease at the cost of binary size.)
@hsivonen is proposing to replace Gecko's encoding converters by a library written in Rust: https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit.
He identifies some arguments against using rust-encoding in its current form. Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).