Open michaelficarra opened 2 years ago
Most uses of split
I have found are for unbounded splits. For those, only String.prototype.split
should be used as is being done now. The proposed String.prototype.splitn
does not support skipping the n
parameter, which thus enforces the use of split
for unbounded splits.
As for confusion: maybe, but I don't anticipate much as the names are different. String.prototype.splitn
should become the primary way to do n-count splitting.
As for confusion: maybe, but I don't anticipate much as the names are different.
I don't think the names being different really matters here. There will be 2 string splitting methods with pretty similar semantics. I fear people (including myself) won't be able to remember which method has which semantics.
String.prototype.splitn
should become the primary way to do n-count splitting.
Why is that? I used the second parameter to String.prototype.split
just a few days ago, and it worked fine. I was doing some simple parsing of a Unicode data file that used #
as a line comment like this:
propertyAliasesFile
.split('\n')
.map(line => line.split('#', 1)[0].trim())
.filter(line => line.length > 0)
Was this bad for some reason?
That seems reasonable! I haven't included that in the explainer, but I did a Sourcegraph search for people using the n
argument in String.prototype.split
the other day, and I found that many people call x.split(sep).slice(0, n)
rather than x.split(sep, n)
. It seems many people are not aware of this feature currently.
For that example specifically, now if you want to instead extract all of the comments rather than the non comments, using splitn
would make it very easy:
propertyAliasesFile
.split('\n')
.map(line => line.splitn('#', 2)[1])
.filter(line => line && line.length > 0)
Doing this with current split is still possible, but is much less efficient. An extra intermediate array is created, the original string needs to be iterated over for longer to find split points, and many more small string allocations are needed for the intermediate array.
propertyAliasesFile
.split('\n')
.map(line => line.split('#').slice(1).join('#'))
.filter(line => line.length > 0)
Ah, that's actually helpful for me to better understand the motivation for this proposal. Still, I'm unsure whether the usefulness outweighs the risk of added confusion.
I think going for the {remainder: true}
option would bypass this possible confusion; .split(..., 2, ...)
would always work the same (return the first two split fragments), but might return a third item with the remainder if the flag is passed (effectively giving us Python's semantics; if we're going to still diverge from other langs, sticking with Python is better than most other options).
(fwiw, i don't like splitn
being all lowercase, and would prefer splitN
if that's the direction it's heading, a la reduceRight and forEach and many others)
I think going for the {remainder: true} option would bypass this possible confusion
Another benefit of this is that it bypasses the questions about regex-based splitting. The behavior remains "do exactly as you normally would", the flag just controls whether the remainder gets added as a final item to the array or thrown out.
At the matrix channel I suggested adopting a different name.
String.prototype.part
or parts
could be more helpful to differentiate them.
Another name I'm adding just for the bikeshed sake would be explode
, being slightly connected to precedents from other languages.
I prefer a new name over expanding the split
method.
I opened an issue about interpretation of the count arg and though I think the Python behavior is best regardless, a substantially different name like explode
would mitigate some of the concerns there.
I fear we're just going to create a "splice" scenario where most people are going to just have to look at the documentation every time to figure out what it does now that there's more than one option.
@michaelficarra I don't think it's "splice" case.
The "splice" is bad because:
Not sure about native speakers, but as who have bad English skill, I really don't know what "splice" meaning in English, I never use this word except writing code, so the name "splice" is just opaque to me.
I guess it's hard to have a good name for what "splice" really do, because the it do a complex work and unfortunately it doesn't follow slice
, and also mix push
.
Off-topic
Instead of a.splice(start, count, ...items)
, it would be much better if we have
a.replaceItems(start, end, items)
which start/end
follow slice, and items
could just an another array.
Or even have slice notation syntax for that: a[start:end] = items
which will easy to understand and use.
/Off-topic
splitN(n, sep)
never have the similar issues of splice:
n
mean. But as the things it tend to do, programmers always need to learn what n
mean, and as the author of this proposal already tell us, almost all programming languages have the consensus of what n
mean. So we just follow it and it could be the programmer's cross-language knowledge (just like people have to learn what NaN mean, and it's a cross-language knowledge).splitN()
really do is not complex, I believe after use it one or two time, people could get it. Some may learn it fast, some may slow, but I am confident that people will never struggle like splice case.I don't think the names being different really matters here. There will be 2 string splitting methods with pretty similar semantics. I fear people (including myself) won't be able to remember which method has which semantics.
@michaelficarra
Actually I see it in another way. Similar names should have similar semantics.
So after we have splitN
, it's clear to programmers, there are two methods:
split(sep)
simple split splitN(n, sep)
split into n piecesI know it could always bring confusion if someone dig it and see split(sep)
is actually split(sep,n=Number.MAX_SAFE_INTEGER)
split(sep, n=Number.MAX_SAFE_INTEGER)
splitN(n, sep)
But it's more or less a documentation issue. The old broken split(sep, n)
could be deprecated , I mean we can't deprecated it in spec, but it could be mark as deprecated in MDN, manual, books and tools.
It seems many people are not aware of this feature currently.
@lucacasonato Yeah, as I said in the meeting , last week we in JSCIG meeting, we discussed this proposal, and we surprisingly found half of the attendances don't know split
have a second arg. It is surprisingly because the most attendances are not average programmers but experienced js programmers. And there is one who is a famous chinese js book writer and he read the different versions of spec many times when writing the book but still forget it. And the rest, even aware it, most never use it and no one know the exact behavior of it except I (because I have been caught by it several times). Even I, don't fully understand the weird behavior when capture group involved.
So I don't worry confusion of two methods too much, actually people won't confused with something they might never know. đ
Another benefit of this is that it bypasses the questions about regex-based splitting.
@tabatkins What "bypass" mean?
You know at first I also feel split(sep, n, {reminder})
might a good option, but after I found capture group weirdness in it, I think it just worse.
let s = 'axbxcxd'
let re = /(x)/
s.split(re) // [ 'a', 'x', 'b', 'x', 'c', 'x', 'd' ]
s.split(re, 3) // [ 'a', 'x', 'b' ] // oh !
s.split(re, 3, {reminder:true}) // what it should returns?
The "consistent" but weird behavior might be return [ 'a', 'x', 'b', 'xcxd']
.
I'm curious how engine could easily return that result without post process on the result returned by underlying regexp engine.
Note python re.split("(x)", "axbxcxd", 3)
give u ['a', 'x', 'b', 'x', 'c', 'x', 'd']
. So we are still diverge to any other programming languages, include python, and in a much subtle way, which I really dislike.
At the matrix channel I suggested adopting a different name. I prefer a new name over expanding the split method.
@leobalter Though at first glance I also felt new name could avoid confusion , but after second thought, I have to say it might not very helpful like we thought.
The point is, if it actually do very similar thing , people eventually would ask why there are two? Which one should I use?
Essentially the confusion is not about the name, but about the usage of two similar apis. The name similarity is just make it easy to expose.
Make the new named api have much different semantic like only support string might helpful (so people are ok with two api because they have different ability), might not (because people may expect the new api also accept regexp, so in the end we need to add similar type check to throw on regexp-like arg --- just like startsWith/endsWith).
There is also another point, api discoverability. As my previous comment, people actually unaware split(sep, n)
, so api name like explode
won't be better. I also wonder how parts/part
could be better (is there any precedent in other languages?).
At the end, for such fundamental api (it's not WeakRef, Atomics, etc.!), I don't believe we should decrease the confusion by decrease the discoverability.
I like using a { reminder: true }
options bag, because it allows potentially introducing more options in the future. For example, an option could make n
mean "don't count strings produced by capturing groups in the limit", or another option could be "don't include capturing groups in the result array".
I believe having this options bag adds too much complexity for this one method that was supposed to do this one thing, limited by another parameter.
The option will open an object that might have a property that says: do this other thing, or that original thing but different, and now I'm slightly confused what that number parameter does. This fails the logic of having index
and indexRight
.
I also find splitn
not amazing as splitN
feels even worse and I want to have a more meaning differentiation. Comparing split
to splitn
feels like substr
vs substring
. I'd rather have split
and part
so I can compare them such as slice
and splice
.
I think it's good to bikeshed some synonyms such as cut
, divide
, break
, splinter
, lop
. Maybe splinter
even gets the advantage of a closer code completion.
I wouldn't expect a third parameter - i'd expect that the signature would be:
where options
would subsume the current integer argument, with a clear name, and offer an alternative/mutually exclusive option name for the desired behavior.
Another benefit of this is that it bypasses the questions about regex-based splitting.
@tabatkins What "bypass" mean?
If we reuse the existing function exactly as it is, with the flag just including the remainder, my intention is that this allows us to ignore the regex questions, because the answer would be "just act as normal".
However, I have since learned that the interaction between a regex splitter and the N argument is worthless, so it doesn't really help us:
"a|b|c|d".split(/\|/, 2);
// returns ["a", "b"]
"a|b|c|d".split(/(\|)/, 2);
// returns ["a", "|"]
That is, the N argument does not mean "perform N splits", it is instead literally identical to just calling split(sep).slice(0, N)
. So the idea of "just run the function as normal, and append the remainder if the flag is set" doesn't work very well. :/
So yeah, going instead with either a new function, or taking the options arg as the second arg like @ljharb suggests, would be better, so we can give better behavior immediately.
I wouldn't expect a third parameter - i'd expect that the signature would be:
- split(strOrRegex)
- split(strOrRegex, integer)
- split(strOrRegex, options)
where
options
would subsume the current integer argument, with a clear name, and offer an alternative/mutually exclusive option name for the desired behavior.
Today we can have things such as split(strOrRegex, new Number(x))
and I'm afraid this prevents us from having an options object in that position.
@leobalter that's true only if there's web code that relies on that, which i doubt.
But also, we could still do a Get for the option names, and if no option names are found, we could ToNumber the object, which would still support that usage.
Yeah I highly doubt we actually have live code depending on toValue() of the second arg, but Jordan's suggestion would let us still apply it if it proves necessary.
Can you help me understand how the options would be like? The way I see it feels confusing.
"a|b|c|d".split("|", 2); // ["a", "b"]
"a|b|c|d".split("|", { parts: 2, remainder: true }); // ["a", "b|c|d"]
"a|b|c|d".split("|", { parts: 2 }); // ["a", "b"] or ["a", "b|c|d"]
I want the remainder option to be optional if anything, but the parts/limit option remains ambiguous. I'd like the options object to have actual options for any end.
Requiring one to write the parts/limit option and the remainder: true
feels a lot. I still favor a new method over that. Of course, that's now a preference.
@leobalter i hadn't really thought about it much. Perhaps something like this:
split(sep, n)
equivalent to split(sep, { mode: 'returned parts', count: n })
split(sep, { mode: 'number of splits', count: n })
equivalent to splitn(sep, n)
etc?
I still have a strong preference for two methods with different responsibilities.
split(x, y)
takes a string value and split it by all occurrences matching x
and returns all the parts limited to Number(y
), which defaults to 2**32-1
, in a new Array.cut|explode|break|divide(x, y)
takes a string value and split it at a limit to y
occurrences of x
, returning all the parts in a new Array.Rather tan:
split(x , y)
takes a string value and
y
is an object with the matching properties mode
and/or count
then,mode
is "something", then ...,x
in a new array limited by Number(y
).I prefer the single responsibility for each method.
split(sep, options)
might better than split(sep, n, options)
, because it won't have the issue of changing the semantic of second param by third param.
But it's still more or less confused that split(sep, {count: n})
would do different thing with split(sep, n)
.
And split(sep, { mode, count })
also have the same problem that use one option change the semantic of the other. đ
It seems we should have an option bag which includes the old non-option bag behavior, this is what addEventListener(type, fn, true)
-> addEventListener(type, fn, {capture:true})
behave.
If we include it, then split(sep, { count })
will be expected to use old behavior. So that means u need to write lengthy split(sep, {count: n, reminder: true, mode: 'number of splits'})
to get the desired behavior.
We could make reminder:true
as default, and we could make mode: 'number of splits'
as default.
Actually it seems much simpler to only have a one boolean option "legacyMode"
So we could use split(sep, {count: n})
to get the desired behavior. and split(sep, n)
is just shorthand of split(sep, {count: n, legacyMode:true})
.
The only concern is, we normally use false
as the default value , not sure how true
as default will confuse people.
But we still have the problem:
split(sep, {count: 3})
is not as ergonomics as splitN(3, sep)
, u need extra 8 characters.
We could make the name shorten to single letter n
, though it seems the language normally don't encourage single letter option name, no precedent I can recall.
Even the shortest form split(sep, {n: 3})
still need extra 3 chars, consider {:}
all need press shift, it actually 6 key stroke.
Previously I thought the reason people not use s.split(sep, n)
is just because its behavior are broken/useless, but now I tend to believe by nature it's hard to discover the api.
Though it's not the original motivation of this proposal, I think it should be the goal of the proposal.
With these two reasons, I still prefer splitN
solution.
For better disclosure, I strongly oppose to sentence-based/composed option values. 'number of splits'
works well only on English centric environments and even societies.
Coming from an ESL background, I know people see the language words as not other than keywords understanding the logic behind but without the actual meaningful dictionary information. I'm happy to tell more about this to anyone interested.
'number of splits' works well only on English centric environments and even societies.
I suppose 'number of splits'
is just a placeholder for a good option name. Though in this specific case, I don't know how we could have a better name. But as my previous comment, maybe we don't need that, legacyMode: boolean
may be a better choice, or in a new method, we don't need options at all.
As a non-native speaker, I should agree that "sentence-based/composed" are bad, but I want to say another (might be more) important point is we should use simple english words.
Offtopic
Here is a repo https://github.com/mahavivo/english-wordlists which contains different level english word lists in Chinese education system, I use it to check how hard a word is for non-native speakers.
For example, "splice" is not in word list of CET4 or CET6 or even TEM8 (which is for graduate student major in English, and contains 13000 words). And as word usage frequency from COCA (Corpus of Contemporary American English) , "splice" is at 20171 đ . So it's reflect the @bathos comment, even native speakers do not use that word much.
/Offtopic
Yes, i intentionally used a sentence as a placeholder, because obviously that wouldnât fly.
I fear we're just going to create a "splice" scenario where most people are going to just have to look at the documentation every time to figure out what it does now that there's more than one option.