inidun / text_analytics

Text analytic tools
3 stars 0 forks source link

Diff in token counts #31

Open aibakeneko opened 2 years ago

aibakeneko commented 2 years ago

Token not counted when linked with special char.

Example: culture- not counted as culture in notebooks/word_trends/word_trends.ipynb

https://github.com/inidun/text_analytics/blob/4d0fad3b9dbed04b4d875fa4a9efe287e1c08ce0/resources/SSI.yml#L6

Use SPECIAL_CHARS from https://github.com/humlab/penelope/blob/9f1c7e90cc965ac86d20ec5df8adad04371310d1/penelope/corpus/transforms.py#L28-L30

SPECIAL_CHARS = {
    'hyphens': '-‐‑⁃‒–—―',
    'minuses': '-−-⁻',