hermitdave / FrequencyWords

Repository for Frequency Word List Generator and processed files
MIT License
1.16k stars 553 forks source link

Punctuations marks are not ignored in Urdu #18

Open victorbnl opened 3 years ago

victorbnl commented 3 years ago

Disclaimer

⚠️ I don't speak Urdu at all so please don't take directly in account what I say and ask a real Urdu speaker ⚠️

Issue

But there seem to be some characters that should be ignored in Urdu :
https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/ur/ur_full.txt

See lines 3 and 5

What makes me think these are not words but really punctuation as someone who doesn't speak Urdu are the characters' name :
ARABIC COMMA and ARABIC FULL STOP

Potential fix

If you made sure this is not just me that has not enough knowledge of the language but a real issue, my fix would be to add

،

and

۔

in the ignored characters list