jly36963 commented 9 months ago

Words and Apostophes

Words and related casing utilities have weird behavior for words containing '. I updated the logic to be more like lodash's (removing apostrophes before splitting words)

Relates to, but is not a fix for: https://github.com/gustavoguichard/string-ts/issues/122

Words

// Old behavior
type test = Expect<Equal<Words<"don'tGo">, ["don", "'", "t", "Go"]>>;
// Updated behavior
type test = Expect<Equal<Words<"don'tGo">, ["dont", "Go"]>>;

Casing

// Old behavior
type test = Expect<Equal<SnakeCase<"don'tGo">, "don_'_t_go">>;
// Updated behavior
type test = Expect<Equal<SnakeCase<"don'tGo">, "dont_go">>;

p9f commented 9 months ago

I updated the logic to be more like lodash's

I would like to challenge this. Lodash approach is more complex than just "removing apostrophes before splitting words" because lodash behaviour will be different for where's (treated as one word) and ma'am (treated as two words) (you can see the code here, there is more variations of this logic.).

So with the test sentence "Where's the leak ma'am?", we have:

lodash: ["Where's", "the", "leak", "ma", "am"]
string-ts (current): ["Where", "'", "s", "the", "leak", "ma", "'", "am"]
string-ts (pr): ["Where's", "the", "leak", "ma'am"] as you can see these are three different values.

I think I would be in favour of removing all special characters, but still count them as a word separator - i.e. go for: ["Where", "s", "the", "leak", "ma", "am"].

also - do we agree this is a breaking change that should wait for a v2? if yes, we might want to think about this very carefully, because we don't a new major versions to often, so we might want to think about other breaking changes (like lodash remove all special characters like %, &, ... and we currently don't).

jly36963 commented 9 months ago

I would like to challenge this. Lodash approach is more complex than just "removing apostrophes before splitting words".

_.camelCase("Where's the leak ma'am")
// "wheresTheLeakMaam" 

_.snakeCase("Where's the leak ma'am") 
// "wheres_the_leak_maam"

I guess they do the apostrophe removal at the casing level, before it is passed into words.

We could move the removal of apostrophes to the casing function like they do here. This might be nice for a couple of reasons:

Words and the casing functions would align with lodash
We could choose whether or not to do the apostrophe removal for different casing functions
- Machine readable (snake/camel/kebab): Remove apostrophe
- Human readable (title): Keep apostrophe

Can you think of scenarios where words should be separated based on an apostrophe?

ma'am should definitely be one word
Common contractions should be one word
Not sure about other languages
- ballon d'or -- "golden" should be one word, right?

Either way, SnakeCase<"ma'am"> // ma_'_am is definitely the wrong answer

gustavoguichard commented 9 months ago

Yes, I agree with @p9f that we should think carefully about (any) change in the words method. I like @jly36963 proposal from his latest comment.

If we just "fixed" machine readable casing functions, would that be a major version?

I'm also a little afraid of going down the "perfect grammar" route bc it can lead to a much more complex code. Wdyt?

jly36963 commented 9 months ago

Based on this issue, it might be worth:

keeping ' at the words level
- ["Where's", "the", "leak", "ma'am"]
- Note that this is still a change, as ' isn't treated as a new word
removing ' at the (machine-readable) casing level, before the input is passed into Words
- wheres_the_leak_maam

gustavoguichard commented 9 months ago

It does makes sense. I tend to adopt the same positioning of J Dalton there: limiting the scope to avoid creating a lil monster, but with that said if no one can come up with a con for that we can add that behavior. AFAICT in both English and Portuguese it will work.. not sure about other languages

p9f commented 9 months ago

@jly36963 ballon d'or -- "golden" should be one word, right?

fwiw in french I think "ballon", "d", "or" would be better since d is the contraction of de, a different word from or (kinda ball [made] o[f] gold). I don't think that really matter though - surely French is out of scope.

I don't see any downside in your latest solution right now, even though I think we should take our time to consider all the implications. It seems the best of both world.

What do you think of, in the same release, changing the logic for special character so that they don't create artificial word? i.e. converting hello, world into hello, world instead of the current hello, ,, world (and this for all character except letters, digits and single quote.

jly36963 commented 9 months ago

I'm not too sure what to do with the different special characters scenarios. It's probably beyond the scope of this PR.

Scenarios:

abc,def -> abc def
- Treat comma like separator?
best-f$%^ing-birria -> best f$%^ing birria
- ignore non-separator special chars (don't remove or split word)
cão -> cão
- ignore diacritic/foreign letters
abc123%^& -> (same)
- Don't separate when switching between types, only split on separators and case changes.
- vscode does this with built-in snake/camel case utilities.

I've been thinking of a future solution, maybe something like:

Finite list of "separator characters"
Finite list of removable "special characters" (!@#$%^&...)
Words only splits on "separator characters" and case changes.
Casing functions remove "special characters"
- Maybe add boolean flag to each casing function, default to true?
diacritic/foreign letters and emojis are left alone (not removed or split on)

I could put this in a github discussion

gustavoguichard commented 9 months ago

I'm not too sure what to do with the different special characters scenarios. It's probably beyond the scope of this PR. Keep in mind that supporting international characters is out of the scope of the library at all.

jly36963 commented 9 months ago

@gustavoguichard I'm okay either way.

I think a majority of people would expect the updated behavior (eg: wheres over where_'_s), but there's always a chance that someone is relying on the current behavior.

gustavoguichard commented 9 months ago

@jly36963 I'd do a major change ;)

gustavoguichard / string-ts

fix: words should drop apostrophes #123

Words and Apostophes

Words

Casing