tc39 / proposal-reversible-string-split

String.prototype.splitn for ECMAScript
MIT License
82 stars 4 forks source link

will having two split methods cause confusion? #6

Open michaelficarra opened 2 years ago

michaelficarra commented 2 years ago

I fear we're just going to create a "splice" scenario where most people are going to just have to look at the documentation every time to figure out what it does now that there's more than one option.

lucacasonato commented 2 years ago

Most uses of split I have found are for unbounded splits. For those, only String.prototype.split should be used as is being done now. The proposed String.prototype.splitn does not support skipping the n parameter, which thus enforces the use of split for unbounded splits.

As for confusion: maybe, but I don't anticipate much as the names are different. String.prototype.splitn should become the primary way to do n-count splitting.

michaelficarra commented 2 years ago

As for confusion: maybe, but I don't anticipate much as the names are different.

I don't think the names being different really matters here. There will be 2 string splitting methods with pretty similar semantics. I fear people (including myself) won't be able to remember which method has which semantics.

String.prototype.splitn should become the primary way to do n-count splitting.

Why is that? I used the second parameter to String.prototype.split just a few days ago, and it worked fine. I was doing some simple parsing of a Unicode data file that used # as a line comment like this:

  propertyAliasesFile
    .split('\n')
    .map(line => line.split('#', 1)[0].trim())
    .filter(line => line.length > 0)

Was this bad for some reason?

lucacasonato commented 2 years ago

That seems reasonable! I haven't included that in the explainer, but I did a Sourcegraph search for people using the n argument in String.prototype.split the other day, and I found that many people call x.split(sep).slice(0, n) rather than x.split(sep, n). It seems many people are not aware of this feature currently.

For that example specifically, now if you want to instead extract all of the comments rather than the non comments, using splitn would make it very easy:

  propertyAliasesFile
    .split('\n')
    .map(line => line.splitn('#', 2)[1])
    .filter(line => line && line.length > 0)

Doing this with current split is still possible, but is much less efficient. An extra intermediate array is created, the original string needs to be iterated over for longer to find split points, and many more small string allocations are needed for the intermediate array.

  propertyAliasesFile
    .split('\n')
    .map(line => line.split('#').slice(1).join('#'))
    .filter(line => line.length > 0)
michaelficarra commented 2 years ago

Ah, that's actually helpful for me to better understand the motivation for this proposal. Still, I'm unsure whether the usefulness outweighs the risk of added confusion.

tabatkins commented 2 years ago

I think going for the {remainder: true} option would bypass this possible confusion; .split(..., 2, ...) would always work the same (return the first two split fragments), but might return a third item with the remainder if the flag is passed (effectively giving us Python's semantics; if we're going to still diverge from other langs, sticking with Python is better than most other options).

ljharb commented 2 years ago

(fwiw, i don't like splitn being all lowercase, and would prefer splitN if that's the direction it's heading, a la reduceRight and forEach and many others)

tabatkins commented 2 years ago

I think going for the {remainder: true} option would bypass this possible confusion

Another benefit of this is that it bypasses the questions about regex-based splitting. The behavior remains "do exactly as you normally would", the flag just controls whether the remainder gets added as a final item to the array or thrown out.

leobalter commented 2 years ago

At the matrix channel I suggested adopting a different name.

String.prototype.part or parts could be more helpful to differentiate them.

Another name I'm adding just for the bikeshed sake would be explode, being slightly connected to precedents from other languages.

I prefer a new name over expanding the split method.

bathos commented 2 years ago

I opened an issue about interpretation of the count arg and though I think the Python behavior is best regardless, a substantially different name like explode would mitigate some of the concerns there.

hax commented 2 years ago

I fear we're just going to create a "splice" scenario where most people are going to just have to look at the documentation every time to figure out what it does now that there's more than one option.

@michaelficarra I don't think it's "splice" case.

The "splice" is bad because:

  1. the name "splice" is very confused

Not sure about native speakers, but as who have bad English skill, I really don't know what "splice" meaning in English, I never use this word except writing code, so the name "splice" is just opaque to me.

  1. the things "splice" actually do is really complex

I guess it's hard to have a good name for what "splice" really do, because the it do a complex work and unfortunately it doesn't follow slice, and also mix push.

Off-topic

Instead of a.splice(start, count, ...items) , it would be much better if we have

a.replaceItems(start, end, items) which start/end follow slice, and items could just an another array.

Or even have slice notation syntax for that: a[start:end] = items which will easy to understand and use.

/Off-topic


splitN(n, sep) never have the similar issues of splice:

  1. the name is very clear what it tend to do, the only thing is what n mean. But as the things it tend to do, programmers always need to learn what n mean, and as the author of this proposal already tell us, almost all programming languages have the consensus of what n mean. So we just follow it and it could be the programmer's cross-language knowledge (just like people have to learn what NaN mean, and it's a cross-language knowledge).
  2. What splitN() really do is not complex, I believe after use it one or two time, people could get it. Some may learn it fast, some may slow, but I am confident that people will never struggle like splice case.
hax commented 2 years ago

I don't think the names being different really matters here. There will be 2 string splitting methods with pretty similar semantics. I fear people (including myself) won't be able to remember which method has which semantics.

@michaelficarra

Actually I see it in another way. Similar names should have similar semantics.

So after we have splitN, it's clear to programmers, there are two methods:

I know it could always bring confusion if someone dig it and see split(sep) is actually split(sep,n=Number.MAX_SAFE_INTEGER)

But it's more or less a documentation issue. The old broken split(sep, n) could be deprecated , I mean we can't deprecated it in spec, but it could be mark as deprecated in MDN, manual, books and tools.

hax commented 2 years ago

It seems many people are not aware of this feature currently.

@lucacasonato Yeah, as I said in the meeting , last week we in JSCIG meeting, we discussed this proposal, and we surprisingly found half of the attendances don't know split have a second arg. It is surprisingly because the most attendances are not average programmers but experienced js programmers. And there is one who is a famous chinese js book writer and he read the different versions of spec many times when writing the book but still forget it. And the rest, even aware it, most never use it and no one know the exact behavior of it except I (because I have been caught by it several times). Even I, don't fully understand the weird behavior when capture group involved.

So I don't worry confusion of two methods too much, actually people won't confused with something they might never know. 😂

hax commented 2 years ago

Another benefit of this is that it bypasses the questions about regex-based splitting.

@tabatkins What "bypass" mean?

You know at first I also feel split(sep, n, {reminder}) might a good option, but after I found capture group weirdness in it, I think it just worse.

let s = 'axbxcxd'
let re = /(x)/ 
s.split(re) // [ 'a', 'x', 'b', 'x', 'c', 'x', 'd' ]
s.split(re, 3) // [ 'a', 'x', 'b' ]  // oh !
s.split(re, 3, {reminder:true}) // what it should returns?

The "consistent" but weird behavior might be return [ 'a', 'x', 'b', 'xcxd'].

I'm curious how engine could easily return that result without post process on the result returned by underlying regexp engine.

Note python re.split("(x)", "axbxcxd", 3) give u ['a', 'x', 'b', 'x', 'c', 'x', 'd'] . So we are still diverge to any other programming languages, include python, and in a much subtle way, which I really dislike.

hax commented 2 years ago

At the matrix channel I suggested adopting a different name. I prefer a new name over expanding the split method.

@leobalter Though at first glance I also felt new name could avoid confusion , but after second thought, I have to say it might not very helpful like we thought.

The point is, if it actually do very similar thing , people eventually would ask why there are two? Which one should I use?

Essentially the confusion is not about the name, but about the usage of two similar apis. The name similarity is just make it easy to expose.

Make the new named api have much different semantic like only support string might helpful (so people are ok with two api because they have different ability), might not (because people may expect the new api also accept regexp, so in the end we need to add similar type check to throw on regexp-like arg --- just like startsWith/endsWith).

There is also another point, api discoverability. As my previous comment, people actually unaware split(sep, n) , so api name like explode won't be better. I also wonder how parts/part could be better (is there any precedent in other languages?).

At the end, for such fundamental api (it's not WeakRef, Atomics, etc.!), I don't believe we should decrease the confusion by decrease the discoverability.

bathos commented 2 years ago
(Very off topic aside for @hax cause I thought the mention of splice was interesting) > Not sure about native speakers, but as who have bad English skill, I really don't know what "splice" meaning in English, I never use this word except writing code, so the name "splice" is just opaque to me. I hadn’t considered this before, but I have a feeling this has to do less with one’s English and more to do with how certain physical media technologies have drifted out of the popular consciousness over time. It’s likely that this word has become less familiar among native speakers, too. The word describes (with an uncanny degree of precision) the exact things that the method does: splicing is cutting, inserting, removing, and rejoining segments of something that together form a line. Which seems like a weirdly specific word for most people to know to begin with, but in the 20th century, splicing was very important in at least two physical media technologies. You _spliced tape_ and _spliced film_ when creating and editing music and movies. (The word is older than that, but before film and tape, you’d probably have heard it mainly just in the context of _splicing rope._) (If you were splicing tape, you’d make cuts in the tape in order to extract, insert, or replace a sub-portion. The harder part was rejoining the segments, i.e. without the part you removed and/or with the part you inserted, whichever are applicable. In older music, if the splicing job was sloppy, artifacts of the physical splicing would be audible in the music). Given both tape and film have rapidly became much less common and less visible in the last few decades, the word has probably become much more obscure, too. It’s not quite as frozen in time as “radio button” because splice is still a “generic” word, but the programming use of “splice” seems at least partly informed by an analogy with how the word was used in the context of splicing physical media. So at one time, it would have seemed like a very natural/obvious term, but you’re probably right that this is no longer true.
nicolo-ribaudo commented 2 years ago

I like using a { reminder: true } options bag, because it allows potentially introducing more options in the future. For example, an option could make n mean "don't count strings produced by capturing groups in the limit", or another option could be "don't include capturing groups in the result array".

leobalter commented 2 years ago

I believe having this options bag adds too much complexity for this one method that was supposed to do this one thing, limited by another parameter.

The option will open an object that might have a property that says: do this other thing, or that original thing but different, and now I'm slightly confused what that number parameter does. This fails the logic of having index and indexRight.

I also find splitn not amazing as splitN feels even worse and I want to have a more meaning differentiation. Comparing split to splitn feels like substr vs substring. I'd rather have split and part so I can compare them such as slice and splice.

I think it's good to bikeshed some synonyms such as cut, divide, break, splinter, lop. Maybe splinter even gets the advantage of a closer code completion.

ljharb commented 2 years ago

I wouldn't expect a third parameter - i'd expect that the signature would be:

where options would subsume the current integer argument, with a clear name, and offer an alternative/mutually exclusive option name for the desired behavior.

tabatkins commented 2 years ago

Another benefit of this is that it bypasses the questions about regex-based splitting.

@tabatkins What "bypass" mean?

If we reuse the existing function exactly as it is, with the flag just including the remainder, my intention is that this allows us to ignore the regex questions, because the answer would be "just act as normal".

However, I have since learned that the interaction between a regex splitter and the N argument is worthless, so it doesn't really help us:

"a|b|c|d".split(/\|/, 2);
// returns ["a", "b"]
"a|b|c|d".split(/(\|)/, 2);
// returns ["a", "|"]

That is, the N argument does not mean "perform N splits", it is instead literally identical to just calling split(sep).slice(0, N). So the idea of "just run the function as normal, and append the remainder if the flag is set" doesn't work very well. :/

So yeah, going instead with either a new function, or taking the options arg as the second arg like @ljharb suggests, would be better, so we can give better behavior immediately.

leobalter commented 2 years ago

I wouldn't expect a third parameter - i'd expect that the signature would be:

  • split(strOrRegex)
  • split(strOrRegex, integer)
  • split(strOrRegex, options)

where options would subsume the current integer argument, with a clear name, and offer an alternative/mutually exclusive option name for the desired behavior.

Today we can have things such as split(strOrRegex, new Number(x)) and I'm afraid this prevents us from having an options object in that position.

ljharb commented 2 years ago

@leobalter that's true only if there's web code that relies on that, which i doubt.

But also, we could still do a Get for the option names, and if no option names are found, we could ToNumber the object, which would still support that usage.

tabatkins commented 2 years ago

Yeah I highly doubt we actually have live code depending on toValue() of the second arg, but Jordan's suggestion would let us still apply it if it proves necessary.

leobalter commented 2 years ago

Can you help me understand how the options would be like? The way I see it feels confusing.

"a|b|c|d".split("|", 2); // ["a", "b"]
"a|b|c|d".split("|", { parts: 2, remainder: true }); // ["a", "b|c|d"]
"a|b|c|d".split("|", { parts: 2 }); // ["a", "b"] or ["a", "b|c|d"]

I want the remainder option to be optional if anything, but the parts/limit option remains ambiguous. I'd like the options object to have actual options for any end.

Requiring one to write the parts/limit option and the remainder: true feels a lot. I still favor a new method over that. Of course, that's now a preference.

ljharb commented 2 years ago

@leobalter i hadn't really thought about it much. Perhaps something like this:

leobalter commented 2 years ago

I still have a strong preference for two methods with different responsibilities.

Rather tan:

I prefer the single responsibility for each method.

hax commented 2 years ago

split(sep, options) might better than split(sep, n, options), because it won't have the issue of changing the semantic of second param by third param.

But it's still more or less confused that split(sep, {count: n}) would do different thing with split(sep, n).

And split(sep, { mode, count }) also have the same problem that use one option change the semantic of the other. 😂


It seems we should have an option bag which includes the old non-option bag behavior, this is what addEventListener(type, fn, true) -> addEventListener(type, fn, {capture:true}) behave.

If we include it, then split(sep, { count }) will be expected to use old behavior. So that means u need to write lengthy split(sep, {count: n, reminder: true, mode: 'number of splits'}) to get the desired behavior.

We could make reminder:true as default, and we could make mode: 'number of splits' as default.

Actually it seems much simpler to only have a one boolean option "legacyMode"

So we could use split(sep, {count: n}) to get the desired behavior. and split(sep, n) is just shorthand of split(sep, {count: n, legacyMode:true}).

The only concern is, we normally use false as the default value , not sure how true as default will confuse people.


But we still have the problem:

  1. ergonomics

split(sep, {count: 3}) is not as ergonomics as splitN(3, sep), u need extra 8 characters.

We could make the name shorten to single letter n, though it seems the language normally don't encourage single letter option name, no precedent I can recall.

Even the shortest form split(sep, {n: 3}) still need extra 3 chars, consider {:} all need press shift, it actually 6 key stroke.

  1. API discoverability

Previously I thought the reason people not use s.split(sep, n)is just because its behavior are broken/useless, but now I tend to believe by nature it's hard to discover the api.

Though it's not the original motivation of this proposal, I think it should be the goal of the proposal.


With these two reasons, I still prefer splitN solution.

leobalter commented 2 years ago

For better disclosure, I strongly oppose to sentence-based/composed option values. 'number of splits' works well only on English centric environments and even societies.

Coming from an ESL background, I know people see the language words as not other than keywords understanding the logic behind but without the actual meaningful dictionary information. I'm happy to tell more about this to anyone interested.

hax commented 2 years ago

'number of splits' works well only on English centric environments and even societies.

I suppose 'number of splits' is just a placeholder for a good option name. Though in this specific case, I don't know how we could have a better name. But as my previous comment, maybe we don't need that, legacyMode: boolean may be a better choice, or in a new method, we don't need options at all.

As a non-native speaker, I should agree that "sentence-based/composed" are bad, but I want to say another (might be more) important point is we should use simple english words.

Offtopic

Here is a repo https://github.com/mahavivo/english-wordlists which contains different level english word lists in Chinese education system, I use it to check how hard a word is for non-native speakers.

For example, "splice" is not in word list of CET4 or CET6 or even TEM8 (which is for graduate student major in English, and contains 13000 words). And as word usage frequency from COCA (Corpus of Contemporary American English) , "splice" is at 20171 😂 . So it's reflect the @bathos comment, even native speakers do not use that word much.

/Offtopic

ljharb commented 2 years ago

Yes, i intentionally used a sentence as a placeholder, because obviously that wouldn’t fly.