simov / slugify

Slugifies a string
MIT License
1.47k stars 126 forks source link

correctly handle empty strings in charmaps #174

Closed iliazeus closed 1 year ago

iliazeus commented 1 year ago

Resolves https://github.com/simov/slugify/issues/172

Resolves https://github.com/simov/slugify/issues/95

Trott commented 1 year ago

This introduces a new way to remove characters, but there is already the remove option.

slugify('cat', {remove: /a/ig}) // 'ct'

Not necessarily saying we shouldn't do this, but we might want to pause before introducing multiple ways to do the same thing. If this is a more ergonomic way, consider deprecating remove? If it's a less ergonomic way, don't implement it?

iliazeus commented 1 year ago

@Trott they're not the same semantically.

The "mapping into empty string" is related to the transliteration rules. It is a way to specify in the locale and/or charmap that we don't want a specific letter to be translated into anything. This is already in the charmap for ь and Ь, for example; this should also be in the Russian locale for ъ and Ъ, and there are probably more cases.

The remove argument is a way for the user to specify that they don't want specific characters in the final result. It does not depend on locale and charmap, and is usually used for things like punctuation. The user will probably not be aware of all the transliteration rules, and they won't put things like ь in there when specifying a custom regex.

Trott commented 1 year ago

@Trott they're not the same semantically.

Thanks for the explanation. That makes sense and I understand now.

Trott commented 1 year ago

This is only useful for the twenty six letters of the Latin alphabet, right? For everything else, it already works as expected? Or am I wrong about that? (EDIT: Maybe it's useful for certain common symbols too?)

iliazeus commented 1 year ago

@Trott my own usecase was actually similar to the one described in #172: the Cyrillic characters ь and Ь are (correctly) mapped into empty strings in charmap.json. It should also be done for ъ and Ъ when the Russian locale is added (right now they use Bulgarian transliterations in charmap.json). Thing is, those are proper letters, but since (in Russian, at least) they don't have their own proper sound, they are usually just omitted when transliterating (but in Bulgarian, Ъ isn't!). This feature is useful for these kinds of letters. I don't know enough non-latin and non-cyrillic scripts to offer other examples, though :)

The test uses latin characters for the sake of simplicity.

Trott commented 1 year ago

@Trott my own usecase was actually similar to the one described in #172: the Cyrillic characters ь and Ь are (correctly) mapped into empty strings in charmap.json. It should also be done for ъ and Ъ when the Russian locale is added (right now they use Bulgarian transliterations in charmap.json). Thing is, those are proper letters, but since (in Russian, at least) they don't have their own proper sound, they are usually just omitted when transliterating (but in Bulgarian, Ъ isn't!). This feature is useful for these kinds of letters. I don't know enough non-latin and non-cyrillic scripts to offer other examples, though :)

The test uses latin characters for the sake of simplicity.

I'm sorry if I'm being foolish here and missing something obvious, but doesn't that already work without the change in this pull request?

const slugify = require('slugify');

const str = 'ъaъ';

console.log(slugify(str)); // 'uau'

slugify.extend({ 'ъ': ''})
console.log(slugify(str)) // 'a'

Maybe all that needs to happen is to add a Russian locale to config/locales.json?

iliazeus commented 1 year ago

@Trott sorry for being so unclear :)

The tricky part is: in your case, ъ is removed not by the extended charmap, but by the default remove regexp, which matches [^\w] (among others). Here is a quick test with a dummy remove:

> var slugify = require('slugify')
undefined
> slugify('ъяъ', { remove: /[]/ })
'uyau'
> slugify.extend({ 'ъ': '' })
undefined
> slugify('ъяъ', { remove: /[]/ })
'ъyaъ'

The order of steps for the current behavior is:

If someone that is not aware of language-specific rules specifies a custom remove, then mapping to empty string will stop working - it only did because of the default remove. This PR fixes that.

iliazeus commented 1 year ago

@Trott we could instead just document that the remove regexp must contain the [^\w] class, but this would still be counter-intuitive, since it looks like it's mostly meant for punctuation and other non-letter, non-locale-dependent characters. The documented example does actually not match [^\w] :)

iliazeus commented 1 year ago

I've updated the test to be a more relevant example of bugged behavior.

Trott commented 1 year ago

I think I see now. FWIW, the slug module already works this way.

const slug = require("slug");
const slugify = require("slugify");

const str = "ъяъ";

console.log(slug(str)); // 'uyau'
console.log(slugify(str)); // 'uyau'

slug.extend({ ъ: "" });
slugify.extend({ ъ: "" });

console.log(slug(str)); // 'ya'
console.log(slugify(str)); // 'ya'

console.log(slug("ъяъ", { remove: /[]/g })); // 'ya'
console.log(slugify("ъяъ", { remove: /[]/g })); // 'ъyaъ'
simov commented 1 year ago

Published in v1.6.6

@Trott I also added this https://github.com/simov/slugify/commit/3f0b3f594a06d7b51e774e25d5dc288f7247e8bb