jmoenig / Snap

a visual programming language inspired by Scratch
http://snap.berkeley.edu
GNU Affero General Public License v3.0
1.49k stars 744 forks source link

Unable to report length of very long text #3183

Open ToonTalk opened 1 year ago

ToonTalk commented 1 year ago

untitled script pic

too long was created originally as the encoding of 56 costumes but here I just dragged a 224MB file into 8.1.6

DarDoro commented 1 year ago
  • Current (with Arrays) implementation accepts until 1_10⁷ and get an error with 2_10⁷.
  • String implemention accepts until 4_10⁷ (taking quite a while) and 5_10⁷ crash my browser (without a preferably error)

"Numbers from..." already stressed out the browser I've build the test data long text script pic

And got the string of max length 530 10^6 "length of text" breaks at 125 10^6 Chrome@Win10.64, I7-6700HQ, 2.6GHz

Benchmarks for the long text long text script pic (2) long text script pic (1) The Intl.Segmenter() version works forever and should be considered MIA ;)

cycomachead commented 1 year ago

I think perf is important, but it's also important to consider when and how much -- the majority case for operations are small amounts of data and frequent comparisons/calls. Locally it seems like normalize() and the like aren't horribly different for small text.

I also definitely consider lacking support in letter of to be a bug right now.

I tend of agree with BH's solution though that we should do the simple but clear thing first. But we can/should improve the code to not allocate a whole array too. That seems fairly straightforward.

Dealing with large data as a general thing is probably a separate task, since multi hundred MB text files can just crash the browser tab. In safari, I didn't run into the same errors, but a hang...so that's not great either.

brianharvey commented 1 year ago

Can we build an efficient stream-based solution for huge data?

cycomachead commented 1 year ago

Theoretically yes - though this is the place where putting effort first into an offline app will likely get us much further,though we will still have battles since browsers are so far much more optimized towards smaller amounts of data, but we would gain lots of control.  First order is definitely not to crash or error. Then we should work on building efficient tools. Practically speaking though we also need to invest in cloud stuff to make the work well enough. -- Michael BallFrom my iPhonemichaelball.coOn Feb 24, 2023, at 5:26 PM, Brian Harvey @.***> wrote: Can we build an efficient stream-based solution for huge data?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ToonTalk commented 1 year ago

@brianharvey

What I meant by "always do the wrong thing" is to report the byte count of strings rather than the character count. This allows projects to handle longer strings without blowing up, but always gives the wrong answer. (Even more always than I thought, if JS represents all characters in 16-bit chunks!)

If I understand what you are saying I think there are misconceptions of what JavaScript's string length does. First UTF-16 uses one or two bytes. And string length does not report the number of bytes but the number of Unicode characters exactly if they are among the first 64k characters otherwise how many code units. I guess this is what you mean by sometimes the wrong answer while the byte count would be consistent. But the number of code units is also consistent.

ToonTalk commented 1 year ago

Another thing to fix is that emojis sometimes need more than 2 units. E.g. Untitled script pic (95)

brianharvey commented 1 year ago

What I was saying is that we shouldn't report character count until that overflows and then switch to reporting code unit count. Although now that I think it through, that wouldn't be non-monotonic; there'd just be a big gap, like the one in the calendar.

jguille2 commented 1 year ago

Oh my god! I wanted to see thas issue with flags...

At the beginning I thought the problem was only with flags... because there are a lot of controversy there (political issues) http://blog.unicode.org/2022/03/the-past-and-future-of-flag-emoji.html

But this is not the real problem! Unicode create new characters (views) jus adding other (real) characters! And I think this is more than a "data construction" (about adding bytes). Really it's adding characters to create a "character" that really are more than one.

Example: brownman

Two characters? Yes, because: man

and brown

And so, we can do: manplusbrown

Then, Snap! is right. That rainbow flag is 4 characters, because it is 4 characters together creating a "merging" visualization

ToonTalk commented 1 year ago

Yes - 4 characters. But isn't this very confusing: untitled script pic (96)

jguille2 commented 1 year ago

Yes! It's confusing...

But I can't do anything about this. Because pictographs (these single glyphs) are not characters. Other example to show this:

blackcat

That black cat is a single glyph, but it is 3 characters!

Ok. Maybe it's not clear enough... but thinking our current behaviour (with Arrays) is quite good, I'll send a PR to fix current "letter of" and "position of" to be aligned with our current "length" and "unicode" blocks.

Joan

jguille2 commented 1 year ago

Hi! Trying to fix "letterOf" and "positionOf"... If anyone wants to test it... https://jguille2.github.io/snapTestingEmojis/snap.html

DarDoro commented 1 year ago

Yes! It's confusing... but it seems that the hyperised "unicode of" does a quite meaningful job. For Hindi ligatures this time long text script pic (3) long text script pic (4) and consistent long text script pic (5)


BTW: direct play with the unicode composition may be worth s separate unit of the curicullum

long text script pic (6)

ToonTalk commented 1 year ago

As I was doing a (cursory) test I noticed that when you select an emoji like 😊 it looks like this: image

Didn't find anything wrong...

ToonTalk commented 1 year ago

Perhaps Untitled script pic (97) isn't the best algorithm for position of since it breaks largeinto any number of pieces when it only cares about the first one. It creates potentially lots of temporary data to be garbage collected and it doesn't stop when a different algorithm could after finding the position.

jguille2 commented 1 year ago

Thanks Ken. Yes. "position of" can't be directly fixed changing strings by arrays... because the "needle" is not always an array item; it can be a subarray. Then I thought using "normally" indexOfString and after this, find the "real" cutting point (not the byte, nor the string position... just the "letter" position). But using "split" I don't introduce more primitives... and it's really "coherent" with our internal behavior. I guess Jens and Brian will consider this and choose the best solution. If performance (to support better large data) is chosen, we can make alternatives for that "position" block. But I love using current primitives in custom blocks (as at the beginning) because it's a very good option to learn and to move between the low floor and the no ceiling we want to offer...

And yes Dariusz. It was a pity all this mess of emojis... but after all... (with a coherent set of lengthOf, letterOf, unicode...) Snap! will be a very nice tool to explore this emoji world, just showing clearly that "letters" (our word for characters in different blocks) are not the same of glyphs. And seeing that a single glyph can be made with more than one letter, and with the tools to explore, test, add, store... we can play a lot: creating family pictures (like your example) or merging the world of words and pictures ("a black cat"...)

brianharvey commented 1 year ago

I am only an egg...

How does anyone know whether to display a string of atomic emojis or a single molecular emoji? And where to split a very long string of atomic emojis into the particular grouping the user intends? Are there "parenthesis" codes?

As a naive user, when I look at a string of text-things, I expect one text-thing per visible character. If some visible characters are encoded with multiple text-numbers ("codes," but I'm a naive user, so I don't know that technical term), then that string of codes should be represented as a sublist, a list of codes as a single element of the text string. Sadly, text strings aren't lists so it can't be done exactly like that, but I propose that we use lists anyway, under the hood. And the UNICODE OF block could report a plain old list for molecular characters. The UNICODE AS LETTER block could accept a list of codes as input.

We should take this as the first use case for the idea we've been kicking around about how to have (primitive and user-created) abstract data types consisting of a list with the underlying atomic values and a type-tag that includes at least the name of the type and a procedure that generates the printform for that type. So the prototypical type tag would be (rational, "%1 ∕ %2"). (That isn't the ASCII slash, but the Unicode division slash btw.) But the one under discussion would be (compound character, UNICODE %0 AS LETTER) or something, supposing %0 means "a list of all the pieces." (I'm not at all insisting on this particular representation; it's just a straw-man version to illustrate the general idea.)

I guess I am slightly disagreeing with Joan about how we should use the word "letter." He wants it to mean Unicode-code, and I want it to mean glyph, so that Unicode-naive users will see what they expect: LENGTH OF TEXT will give them the same answer they would get by counting the visible string, even if they don't know to call what they want a "glyph." And we could have a Unicode library with blocks for tearing characters apart into codes, or even further apart into bytes maybe.

Don't hit me please.

cycomachead commented 1 year ago

ugh...y'all these are all understandable problems. None of them are new to Snap!. They just take time to build.

Multi-glyph emojis are joined by a 'Zero Width Joiner' character, which means their constituent parts are each valid emoji which actually makes this a neat lesson for students. Unlike some of the other 2 and 3 byte characters which can't be split/combined emoji have some fun properties, especially when you talk about skin tones.

The fact that editing can split and break characters is a function of the fact that the String interface in JS is not aware of Unicode, and indeed that can happen on some math symbols and the like as well.

cycomachead commented 1 year ago

BTW: direct play with the unicode composition may be worth s separate unit of the curicullum

We do this somewhat in our middle school curriculum! Being about to split and recombine emoji are useful, though I would argue this is a case where the defaults should do the right thing and splitting at the code point-level should be a separate function.

brianharvey commented 1 year ago

the defaults should do the right thing and splitting at the code point-level should be a separate function.

+1. What he said.

cycomachead commented 1 year ago

Actually let me just summarize the approaches here:

Each of these is strictly more accurate, but has more trade offs. Most notably the Segmenter API doesn't work in Firefox, and doesn't provide anything more than an iterator to access the results, so it's more annoying to work with.

We have a bunch of text operations:

A separate issue is the lower-level text handling of the morphic UI itself.

Each of these is different when we're talking about text for humans to use than large data files and binary types of data. At least in practice. The primary difference I see is the acceptability of performance vs accuracy changes. And in some cases, such as the actually fun and interesting exercise of learning how emoji work, it's necessary to be able to break a string down.

The good thing today is that we don't really need to invent tools to handle this. Browsers finally have APIs, and in the case of split they even allow us to do things like split by word in a language aware fashion. But, given things like inconsistent browser support, we also have to decide how to handle those trade offs.

jguille2 commented 1 year ago

Phew! We opened a lot of topics (we say "a lot of melons"... but I don't know if this makes sense in English) Leaving aside the complicated issues (limits of capacity and operability, JS APIs and browser compatibility, the possibility of creating our own layer to deal with strings...) I point out the conclusions that I see:

And I try to answer some things about that emoji-problem, although they are more reflections and I think they should not disturb the discussion about implementation.

Joan

brianharvey commented 1 year ago

Yes, our job in making everything work will be easier if we expose users to Unicode. But that's true about almost everything. I mean, making graphics work will be easier if we expose users to pixels. (And they can see the pixels if they ask for that specifically!) But instead we give them turtle graphics, and color pickers, and if you're on a Retina display then one turtle step equals two pixels, invisibly. And I say we should think the same way about text: what you see is what you get. Letters. Which may or may not be glyphs; see next paragraph.

An even worse case for us to consider is ligatures. If you're a professional printer, then when you look at "flag" you see (not counting the quotation marks) three characters, the first of which is u+fb02, ligature fl, but if you're a kid, you see four characters, starting with "f" and "l". (To complicate things further, the software displaying this text is supposed to render the two-character sequence "f"+"l" using the ligature, if the font you're using includes that ligature.) In my view, the kid-friendly way to deal with that is always to show users the two-character sequence even if what's in the byte stream is the ligature. So if a Snap! user asks for LETTER 1 OF that three-code-point "flag" we should say "f". (Unicode ligatures are deprecated, so maybe this case will come up only rarely. OTOH they're still on the Macintosh keyboard.)

(What about weird variant characters, such as ſ (long s, the one that's supposed to look like an integral sign but doesn't in this stupid font) or ß (the German "ss")? I lean toward leaving them alone, but you could definitely make a case for replacing them with "s" and "ss" respectively, at least in = testing.)

A separate set of blocks, in a library, should refer to, and operate on, Unicode code points.

Another separate set of blocks, maybe in another library or maybe real primitives, should refer to, and operate on, bytes. Maybe we use two hex digits in a box, or something, to mean "byte."

ToonTalk commented 1 year ago

I just noticed that String localeCompare addresses many of the issues that Brian has brought up. For example

a = "flag"
'flag'
b = "flag"
'flag'
a.length
3
b.length
4
a.localeCompare(b, 'en', { sensitivity: 'base' })
0 // equal

a = 'réservé'; // With accents, lowercase
'réservé'
b = 'RESERVE'; // No accents, uppercase
'RESERVE'
a.localeCompare(b, 'en', { sensitivity: 'base' })
0

Is this what Brian is asking for?

"base": Only strings that differ in base letters compare as unequal. Examples: a ≠ b, a = á, a = A. Other options available.

And also very important is that this takes into the locale.

I think we should aim towards a solution that takes into account the locale. And regarding Intl.Segmenter() Isn't this polyfill the solution to FireFox (they have been working on Intl.Segmenter() for over 5 years) ? https://www.npmjs.com/package/intl-segmenter-polyfill

I wonder if someone who is an expert in Chinese or Japanese characters should also be advising us since I think there are additional issues here.

ToonTalk commented 1 year ago
  • Answering the origin issue, we are not making changes (for now) to support longer strings. It seems that there is no good quick and global solution, nor definitive (it just widens the range and we collide again with the browsers limits) and it seems better to take care of the "normal" behavior (with reasonable lengths).

OK. But please put some effort in error handling.

brianharvey commented 1 year ago

Yes, Ken, that looks great for comparison. It doesn't answer my need for a LENGTH OF TEXT in which "flag" with and without ligature both have length 4, and every emoji has length 1. But it's a big step; is there a canonical base version of a string whose length we could measure?

ToonTalk commented 1 year ago

It isn't just length - letter of and split too and probably others. untitled script pic (99) untitled script pic (98)

I did a quick search and all I could find was this solution using regular expressions for a couple dozen ligatures.

Solution at the end of https://codegolf.stackexchange.com/questions/66543/squish-unsquish-ligatures

If, when strings are created, this (together with normalize) is applied it could fix many of these problems.

jmoenig commented 1 year ago

In the interest of a reliable, predictable and efficient fix for Ken's original issue I'll be pushing a "fix" that basically reverts to JS length and then reopen this issue, so we can all play with it and keep discussing the benefits and downsides.

jmoenig commented 1 year ago

now live at dev...