jprichardson / string.js

Extra JavaScript string methods.
http://stringjs.com
1.81k stars 234 forks source link

slugify russian (non-english) words #43

Open kimptoc opened 11 years ago

kimptoc commented 11 years ago

Hi,

In the browser console on stringjs.com, I tried the following:

S("ревущий фьорд").slugify().s

which returns "" (empty string)

I was hoping it would return the same as the World of Warcraft API slug, ie "ревущии-фьорд" ( http://eu.battle.net/api/wow/realm/status?locale=ru_RU )

Which seems to remove accents (etc) from chars and convers spaces to hyphens.

I can dream, cant I :)

Thanks, Chris

jprichardson commented 11 years ago

I haven't decided on the multi-lingual position of string.js yet. At first, I allowed PRs with changes that suit other languages. I'm starting to believe that there should be separate string.js libraries for each language. It almost makes too much sense. I'll leave this open to mull this over for a bit.

kimptoc commented 11 years ago

no probs :)

yumitsu commented 11 years ago

The main problem is that JS built-in regexp library does not support 'u' flag to match unicode letters with \w group:

"хаха хихи".replace(/[^\w\s-]/g, '') // => ' '

Workaround is to use unicode chars ranges for designated lang:

"хаха хихи $%@ 007".replace(/[^\w\s\u0400-\u04FF-]/g, '') // => 'хаха хихи 007'

I already added support for cyrillic letters(see #46) as temporary solution, and now I think about something more complicated like a l10n plugin.

brubrant commented 11 years ago

+1.

In my case I want Portuguese Chars: é è = e âãáà = a ç = c Etc.

I know WordPress slugfy function (used to create permalinks) work pretty well; maybe it could work as a starting point. Or this: http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter/

jprichardson commented 11 years ago

I'm still undecided on this. My inclination is to for string.js version 2.0 to have language specific plugins so as to keep the library footprint small when deploying client-side. The Node.js version would probably have it packaged up into one entire library.

jumoel commented 10 years ago

I don't think it should be required to include all languages. What if an English speaking "user" wanted to slugify "My trip to Champs-Élysées"? At the moment that would become my-trip-to-champs-lyses, which isn't quite right, even in English.

Perhaps something like this could be used to handle all cases? http://stackoverflow.com/a/5912746

jprichardson commented 10 years ago

@jmoeller Agreed. I think including all languages would really bloat the library. I also think that it's going to make sense to break this library up into smaller pieces and then have one library tie everything together. Related to #10. I've created a Github org [https://github.com/stringjs] for this endeavor.

hickford commented 10 years ago

The current behaviour of slugify is anglocentric because it deletes non-Latin characters. As kimptoc demonstrates, this is useless to international users. A solution might be define a second function, an international-friendly version of slugify, that preserves non-Latin characters. It would still remove punctuation and replaces spaces with dashes.

jesjos commented 10 years ago

I'd like to support what @jumoel said. slugify should at least be capable of removing diacritics from latin characters, preserving the base char. This should be regarded as something separate from multi-language support, as it's more about alphabets than languages. Perhaps the lib should also have plugin support for different alphabets. In that way slugify could have separate implementations for cyrillic and latin characters.

hickford commented 10 years ago

@jumoel's example convinced me too. Opened another issue, because it's different to OP's request about Russian https://github.com/jprichardson/string.js/issues/109