jmoenig / Snap

a visual programming language inspired by Scratch
http://snap.berkeley.edu
GNU Affero General Public License v3.0
1.5k stars 744 forks source link

Unable to report length of very long text #3183

Open ToonTalk opened 1 year ago

ToonTalk commented 1 year ago

untitled script pic

too long was created originally as the encoding of 56 costumes but here I just dragged a 224MB file into 8.1.6

cymplecy commented 1 year ago

What would like it do instead if the length is too long?

ToonTalk commented 1 year ago

How can a string that is the value of a variable be "too long"?

This works untitled script pic (83)

jmoenig commented 1 year ago

because you're probably not giving it a string, Ken, but a list of strings, in which case Snap! is trying to hyperize it. Is that possible?

ToonTalk commented 1 year ago

I just imported a TXT file

ToonTalk commented 1 year ago

untitled script pic (84)

ToonTalk commented 1 year ago

I can click on the variable and it displays some of its value:

untitled script pic (85)

ToonTalk commented 1 year ago

untitled script pic (86)

jmoenig commented 1 year ago

I don't have such a long text file lying around on my laptop :) I've just tried reproducing this with a bunch of my biggest files, i.e. the sources of Snap!, but they all work fine. Lemme try to find a BIIIIIG file...

jmoenig commented 1 year ago

Ah, now I think I remember: We're not using the length property of the string. Instead we're creating an Array from the string and then take its length. The reason for this, if memory serves, is to better support non-Western languages with multi-byte characters. That change was introduced by @cycomachead a while ago. Of course, this can bite you when working with "big" data, because it essentially copies the whole big string temporarily, and - lamentably - because modern browsers, especially Chrome, totally lack any reasonable memory management and instead decide to simply crash or fail in strange ways (man, I want my Squeak back!). As a general remedy I guess working with "big" data requires you to use a data base web server at some time.

jmoenig commented 1 year ago

I really don't know how important that change from @cycomachead is, and whether anybody in Asia actually depends on it. I'd be perfectly fine to go back to just using the String.length property, if it were just for me. Any thoughts?

ToonTalk commented 1 year ago

String.length seems to do the right thing:

"折断".length 2 new Blob(["折断"]).size 6

It takes 6 bytes but has a length of 2

jmoenig commented 1 year ago

yeah, that's what I thought it would do. Let's wait for @cycomachead to weigh in on this, it was his change, and I'm sure Michael was doing it for some reason....

jguille2 commented 1 year ago

Yes, let's think on more cases to choose the best solution.

One difference I see (maybe it's not important... I only note it): The length of an "emoji". Now is 1. With the direct "toString().length" would be 2.

I don't think there will be many more cases, because we are "toStringing" before making the Array. So differences will be only in the "strings" side, and not in other types of data.

Joan

jmoenig commented 1 year ago

Ah, you're right, Joan! I think it was this particular use case of making Snap! better for interacting with emojis that prompted Michael to introduce that change! Good point, thank you for reminding us.

ToonTalk commented 1 year ago

The current scheme has another bug untitled script pic

Split works fine

jmoenig commented 1 year ago

right, but wasn't that the idea? I think it was...

ToonTalk commented 1 year ago

I thought the idea is treat an emoji as a single letter - hence split by letter and length of text treat it as a single thing. But letter ... of ... reveals the underlying implementation of emojis

And what is going on here? untitled script pic (87)

And in the following the text is the same - I just double clicked on the emoji image

cycomachead commented 1 year ago

The length of a string in characters is NOT the number of bytes. JS gets this wrong, C gets this wrong. I mean, I get why. But we do not live in an only ASCII worldThe use cases here is really any UTF8+ string with multibyte characters. There’s definitely a lot of emoji and I think we do have a number of uses who would like to type in their native languages. It’s not the CJK languages but also diacritics and accented characters even in western languages. For emoji this gets quite complicated since modern emoji are actually sets of multibyte characters joined by a special Unicode “zero width joiner” char that the OS then knows how to display. Some of the issues that you revealed are present deep in morphic which naturally uses the length and indexing of strings when editing text fields. In any event, there seems to be a clear limit we are hitting - we should catch this case and report something useful like the raw length. Ken- what browser and OS are you using? -- Michael BallFrom my iPhonemichaelball.coOn Feb 20, 2023, at 8:01 AM, Ken Kahn @.***> wrote: I thought the idea is treat an emoji as a single letter - hence split by letter and length of text treat it as a single thing. But letter ... of ... reveals the underlying implementation of emojis And what is going on here?

And in the following the text is the same - I just double clicked on the emoji

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ToonTalk commented 1 year ago

I'm using Chrome Version 110.0.5481.104 (Official Build) (64-bit) on Windows 10

According the MDN ... For common scripts like Latin, Cyrillic, wellknown CJK characters, etc., this should not be an issue, but if you are working with certain scripts, such as emojis, mathematical symbols, or obscure Chinese characters, you may need to account for the difference between code units and characters.

brianharvey commented 1 year ago

I think we should consistently treat strings as strings of characters, not strings of bytes, even if some of the characters are emoji. We could have a (hidden behind Relabel?) BYTE LENGTH OF block to solve Ken's original problem. This will, I guess, hair up LETTER _ OF, but we should do the right thing. Perhaps Javascript offers the right thing? If not we have to internally SPLIT the string.

By the way, in the case of error the message "Range error" is wrong; it should be "domain error." (The error is that the domain of UNICODE _ AS LETTER is small integers, not characters.)

cymplecy commented 1 year ago

If I could add my two cents worth.

I've run into issues with binary data representation within Snap! - MQTT extension and binary data (media files) using URL reporter

MQTT was sorted by adding an option to treat payload as binary bytes or UTF-16 text strings, so OK with that

But once binary data ends up being "stringified" into UTF-16 unicode - then AFAIK, there's no way of "un-stringifying" back to bytes.

But if its decoded into bytes, then there's always an option to treat it as UTF-16 (or whatever)

brianharvey commented 1 year ago

I don't think that's right. Given a string of Unicode characters, there's no out-of-channel signalling involved in determining where a character starts or ends; that information is in the bytes themselves. So a string of Unicode characters is just an array of bytes, which the programmer decided to interpret as Unicode rather than as something else. So "un-stringifying" just means squinting your eyes a little at the same byte array.

ToonTalk commented 1 year ago

By the way, in the case of error the message "Range error" is wrong; it should be "domain error." (The error is that the domain of UNICODE _ AS LETTER is small integers, not characters.)

Glad that I caught an improper error message - I had meant to use unicode of (which works fine). I guess I was just getting tired...

DarDoro commented 1 year ago

Snap uses Array.from( text).length exactly the same result without allocating the intermediate array but still based on the @@iterator interface.

function GCLen( text){//grapheme cluster length RAW
    let n = 0;
    for( let ch of text){
        n++;
    } 
    return n;
}

Locale aware interface based on The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string. (can/should be instructed to use the given locale)

function IntlGCLen( text){//locale specific grapheme cluster count, not supported in FF
    let n = 0;
    for( let ch of new Intl.Segmenter().segment( text)){
        n++;
    } 
    return n;
}

Test scripts (JS) untitled script pic - 2023-02-23T235015 398

untitled script pic - 2023-02-23T234647 128

untitled script pic - 2023-02-23T234701 008

jguille2 commented 1 year ago

Hi! Returning to the original issue we'll get the answer.

If we finally keep "length of text" supporting emojis using Array.from(str.toString()).length (and so keeping that "too long text" issue), then we only have to apply the same code to the "letter () of ()"

So we can fix this changing to

Process.prototype.reportBasicLetter = function (idx, string) {
    var str, i;

    str = isNil(string) ? '' : Array.from(string.toString());
    if (this.inputOption(idx) === 'any') {
        idx = this.reportBasicRandom(1, str.length);
    }
    if (this.inputOption(idx) === 'last') {
        idx = str.length;
    }
    i = +(idx || 0);
    return str[i - 1] || '';
};
ToonTalk commented 1 year ago

I would encourage folks to run some benchmarks to see what different solutions cost. Treating emojis as single letters is desirable but we should consider the costs.

brianharvey commented 1 year ago

But it isn't just emoji, right? It's basically any non-ASCII character. I think handling Unicode properly is a sine qua non for software these days.

ToonTalk commented 1 year ago

@brianharvey No JavaScript length, etc work fine for 64k Unicode characters. As MDN states "For common scripts like Latin, Cyrillic, wellknown CJK characters, etc., this should not be an issue, but if you are working with certain scripts, such as emojis, mathematical symbols, or obscure Chinese characters, you may need to account for the difference between code units and characters." MDN

cycomachead commented 1 year ago

It's a shame the Segementer interface isn't in Firefox. This would actually be the perfect function for split as well.

I think the first thing to do is protect against the actual error, which should just be str.length >= 0xFFFFFFFF (2^32 - 1) It probably then makes sense to just conditionally implement the most correct interface.

Tbh, I don't think (better) solving the "count chars as chars" problem fully addresses Ken's issue - when you're presumably just dealing with a chunk of data as data. Iterating, and even the current path of creating an array aren't really ideal for large data. (It actually surprises me a bit that 220mb of text is hitting this limit...I'll need to produce a similar file.)

ToonTalk commented 1 year ago

As DarDoro posted you can reproduce this as image

cycomachead commented 1 year ago

We (especially Jens) spend a ton of time on internationalization. It's important.

Even if it is "just" emoji -- we're building a tool for kids and teenagers. Being able to use emoji is something that I think is very engaging for students. (It also leads some very cool culturally relevant discussions about text, computing, data, and representation.)

I know how to trigger the error in general. It the real world use case...

brianharvey commented 1 year ago

@ToonTalk So for Cyrillic etc. the number of characters is already different from the number of bytes? How isn't that already problematic? (And anyway math characters are important!)

ToonTalk commented 1 year ago

lengthdoesn't return the number of bytes Maybe I should have also quoted this from MDN This property [length of a string] returns the number of code units in the string. JavaScript uses UTF-16 encoding, where each Unicode character may be encoded as one or two code units, so it's possible for the value returned by length to not match the actual number of Unicode characters in the string.

cycomachead commented 1 year ago

There's plenty of latin chars that also cause problems, especially if diacritics get split in 2, but are displayed as one.

> const drink = 'cafe\u0301';
undefined
> drink
'café'
> drink.length
5

c.f. https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/#33-string-length

Also, FWIW, this error message also seems annoyingly to be a red herring of sorts. The MDN spec makes it clear that the length of a text file I assume Ken is using should fit in an array. But it's slow, and in node/V8 can actually just overflow the heap with 300mb of text.

I do think there's a difference here in tasks and intent, in practice. (though, in reality proper character counting should work regardless of size...) Students should be able to write and use text that suits them, including math and emoji.

cycomachead commented 1 year ago

Oddly, it seems to be about the 100,000,000 character mark that the current technique fails. (This doesn't seem super clearly documented, but the stack trace I can pull out of node suggests a deeper issue...oh well.)

I think we can decide if we want to give up precision with say > 100MB text, or if we should just implement the GCLen style method dardoro proposed. (I assume it uses less memory than the current approaches...)

ToonTalk commented 1 year ago

@brianharvey

(And anyway math characters are important!)

I think nearly all mathematical symbols are in UTF-16

E.g. "⪊".length // that is "greater than and not approximate" 1

See https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode

I believe these are the only mathematical symbols that are a problem

https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Arabic_Mathematical_Alphabetic_Symbols_block

ToonTalk commented 1 year ago

@cycomachead

> const drink = 'cafe\u0301';
undefined
> drink
'café'
> drink.length
5

But this reveals a bigger problem untitled script pic (88)

"café".length 4 "cafe\u0301".length 5

ToonTalk commented 1 year ago

And string normalize should fix the problem with diacritics -- Snap! should probably normalize all strings.

ToonTalk commented 1 year ago

Is see that normalize doesn't cover all cases but does it cover almost cases that occur normally?

brianharvey commented 1 year ago

I don't think we should invisibly change string length methods when one fails, because then string length will be non-monotonic, and users won't know what "length of text" is telling them. Always doing the wrong thing would be better than sometimes doing the wrong thing, although of course always doing the right thing is best of all. Having a byte length operator would let users control what they see, as well as being the right thing for binary data.

I admit, I personally don't use Arabic math characters, and maybe we could get away with not handling them properly, although perhaps some of our users are Arabs.

As for equality testing, I think = should be very very forgiving, e.g., é should be equal to e and to 𝑒. (And therefore café should equal cafe.) Whereas IDENTICAL TO should mean that the bits are exactly the same. Does Unicode help us do the = that I want? E.g., in the code tables is there a way for é to say "I am a variant form of e"?

ToonTalk commented 1 year ago

Turns out there are other mathematical symbols than Arabic ones. E.g. BOLD CAPITAL OMEGA

'\u{1D6C0}'.length 2 '\u{1D6C0}' '𝛀'

Still all the ones listed on the Wikipedia page except the Arabic ones should be fine.

ToonTalk commented 1 year ago

Here is another place where diacritics cause problems untitled script pic (89)

untitled script pic (90)

ToonTalk commented 1 year ago

And untitled script pic (91) untitled script pic (92)

ToonTalk commented 1 year ago

And emojis have this problem too untitled script pic (93)

ToonTalk commented 1 year ago

Sorry I couldn't resist: Untitled script pic (94)

https://en.m.wikipedia.org/wiki/Basmala

cycomachead commented 1 year ago

You’re missing the point by finding cases that do work correctly. Obviously most cases here work correctly. It’s why “always doing the wrong thing” is not really a thing. But I don’t think the argument here should be that because some of this doesn’t work super well today, we should just nuke the idea. By that logic 100MBs of data ALSO does not work super well with Snap!  It’s trivial to crash a browser process and lose work, and you can’t even use the cloud. The process with even moderate amounts of data quite honestly is poor. Very few students and courses do it. But I’m not going to argue we should just block giant data files. Unicode “look-alikes” are a huge problem in so many places. I do think it’s fair for the equal block to normalize strings.  We already lowercase strings, so using normalize should be a huge problem, but we also need to understand some of the cases where it might fail in worse ways, though what I read long ago suggests this should be safe. I’m with you that this is a case we should support. But reducing support for higher order Unicode seems like the most wrong path to take. -- Michael BallFrom my iPhonemichaelball.coOn Feb 24, 2023, at 8:07 AM, Ken Kahn @.***> wrote: Sorry I couldn't resist:

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

jguille2 commented 1 year ago

Hi! Sorry... but I think we don't have to make it more complicated.

We have a first discussion with three options:

But let's not think about all the other examples... since whatever the decision is, it is clear that we can fix all the other cases. For example, if we leave the current behavior (with Arrays) then we will fix all the other blocks ("letter of", "position of"...) that are not using arrays now. But these cases are not extra examples for the original discussion (all are the same problem!)

Joan

ToonTalk commented 1 year ago

If we leave the issue with large data as is (and fix "letter of", "position of", etc) then perhaps we need better error handling of large data - I only lost an hour of work (and a few dollars of API calls) before I understood what the problem was and changed my project so that the 200MB string is now in over 50 pieces (without making the code harder to understand). Maybe an error should be thrown with a nice explanation that Snap! can't handle strings longer than X.

Though I do feel like the efforts with hyperblocks and the speedy processing of images and sounds gives the impression that Snap! performance is important and I do wonder what the cost is for projects that process lots of text if they have to use implementations of string primitives that are X slower than just using JavaScript strings. My view is that it depends upon what X is. If the cost is low enough and large data triggers nice error messages I'll be quiet.

jguille2 commented 1 year ago

Thanks Ken. I don't want to minimize the problem... just avoid expanding the examples that will take us away from the real problem.

I only want to point something that makes me vote to keep the current behavior with Arrays (and then fix the other cases we've discussed). Changing to JS string.lenght... does not fix the "large data" problem. It just expands the tolerance... and depending on how bad it is. I know it depends on browsers... but I test this example on Chromium testLength script pic and then:

brianharvey commented 1 year ago

What I meant by "always do the wrong thing" is to report the byte count of strings rather than the character count. This allows projects to handle longer strings without blowing up, but always gives the wrong answer. (Even more always than I thought, if JS represents all characters in 16-bit chunks!) I wasn't making any claims about other issues such as equality testing, just that always giving the wrong answer is better than only sometimes doing so.

I'm not sure I really believe that, in retrospect. Users write code like FOR I=1 TO (LENGTH OF TEXT (FOO))... and think they're counting characters, and that should and currently does work fine for non-enormous texts.

So I would just give a meaningful error message for LENGTH OF TEXT of huge texts, and also provide a BYTE LENGTH OF that might even work for some non-text data types such as costumes (reporting the byte length of the bitmap) but I wouldn't insist on that. I repeat that that's way better than having LENGTH OF TEXT be non-monotonic.

About equality testing, Unicode normalization is imho the tip of the iceberg. For example, it doesn't make   equal to space. All the weird spaces (en space, thin space, etc.) should "normalize" to space, imho. The various-font math alphabets should equal regular letters. The various cases of large-something and small-something glyphs should be equal. I don't know how hard I want to push on that, since Ken's quite right that a proper equality tester would be super-slow and/or require a table of 2^16 equivalence classes of characters. OTOH people doing big-data things can and probably should use IS IDENTICAL TO anyway, regardless of speed.