simov / slugify

Slugifies a string
MIT License
1.5k stars 129 forks source link

Cyrillic character in url causing encoding issues #172

Closed emoutisheva closed 1 year ago

emoutisheva commented 1 year ago

Hi! 👋

Firstly, thanks for your work on this project! 🙂

Today I used patch-package to patch slugify@1.6.5 for the project I'm working on.

We use it to convert translations from other languages to english for url slugs for example : /anglijskaya-premьer-liga/ The character ь is mapped to an empty string in your charmap.json and it does not get converted or removed: var appendChar = locale[ch] || charMap[ch] || ch; resulted in var appendChar = undefined || "" || ь. This however was causing some encoding issues resulting in a redirect loop and not being able to open the urls. Changing the || to ?? will remove the character.

Here is the diff that solved my problem:

diff --git a/node_modules/slugify/slugify.js b/node_modules/slugify/slugify.js
index ca37320..2540ca8 100644
--- a/node_modules/slugify/slugify.js
+++ b/node_modules/slugify/slugify.js
@@ -33,7 +33,7 @@
     var slug = string.normalize().split('')
       // replace characters based on charMap
       .reduce(function (result, ch) {
-        var appendChar = locale[ch] || charMap[ch] || ch;
+        var appendChar = locale[ch] ?? charMap[ch] ?? ch;
         if (appendChar === replacement) {
           appendChar = ' ';
         }

This issue body was partially generated by patch-package.

simov commented 1 year ago

I think another way to solve this is by extending slugify with your own translation of the ь character that is something other than an empty string ''.

Or maybe add it as a new locale in this module. As for the reason why ь was set to empty string initially I'm not sure. I think I set it initially in reference to the Bulgarian language, but now thinking about it that doesn't make much sense either because in that case it has to be set to i - ьо -> io maybe.

emoutisheva commented 1 year ago

Yes, if it's set to something else likeiowould work as it will be replaced. So there are different ways of solving this issue, on our end we've chosen the above, although we could have added a new location for 'ru' or simply replaced the "" in the charmap.json. Anyway just wanted to make you aware of this as it could cause url encoding errors and you can best decide how to/if to change it.

Trott commented 1 year ago

I'm a little confused. I'm unable to repro the issue here, unless I'm misunderstanding. Maybe you're using an option that isn't the default somewhere? When I use the latest version of slugify, it removes the character.

const slug = require('slug')
const slugify = require('slugify')

console.log(`SLUG "${slug('ь')}"`)
console.log(`SLUGIFY "${slugify('ь')}"`)

console.log(`SLUG "${slug('aь')}"`)
console.log(`SLUGIFY "${slugify('aь')}"`)

Output:

SLUG "0yw"
SLUGIFY ""
SLUG "a"
SLUGIFY "a"

slugify removes the character in both cases.

slug does too but takes a different approach if the resulting slug is an empty string in its entirety. It generates a base64 encoding of the string as a slug. This way, you at least get something useable (if not ideal) if you use many strings consisting entirely of characters that are discarded.

Trott commented 1 year ago

Also note that using ?? is only available in Node.js 14 and newer. Since slugify purports to support Node.js version back to 8, this would be a breaking change. (However, Node.js earlier than 14 is EOL, so I would support/encourage such a breaking change.)

simov commented 1 year ago

Actually you are correct, @Trott, I was not able to reproduce that either. I thought I know about the issue because I remember someone reporting it in the past. @emoutisheva can you provide a sample that reproduces the issue? I tried with anglijskaya-premьer-liga and the ь character was removed.

emoutisheva commented 1 year ago

To clarify it a bit, we are passing translation "Английская Премьер-лига" . We are using: return slugify(name, { locale:'ru', lower: true, remove: /#|@|\?/ });. which results in "anglijskaya-premьer-liga". But if I test it your way, without the remove regex, it will remove the character and result it "anglijskaya-premer-liga". So digging a bit deeper it looks like the default remove regex you use in slugify.js line 42: .replace(options.remove || /[^\w\s$*_+~.()'"!\-:@]+/g, '') removes that characterь. Screenshot 2023-02-20 at 12 03 20

simov commented 1 year ago

Yes, that is because ь falls outside of A-Za-z range of the \w class. Since you are already using a locale for the Russian language then why not define the translation for that character there?

emoutisheva commented 1 year ago

Decided to add the locale. Thank you for your suggestions and time.