mustafa-tariqk / mindscape

Experience the truth of the trip
https://research.cs.queensu.ca/home/cisc498/index.html
MIT License
16 stars 0 forks source link

Handle short-form and hyphenated words #40

Closed BasicallyOk closed 7 months ago

BasicallyOk commented 8 months ago

Is your feature request related to a problem? Please describe. Word Cloud currently removes hyphens (-) and single quote ('). This causes short-forms like "can't" or "I've" to be treated as new words.

Describe the solution you'd like Support custom rules for languages possibly? A language parser to convert everything to full form. Hyphenated words is a little more complicated, so should be treated as its own word for now.

Describe alternatives you've considered Record possible short forms to database. This would be a little too inefficient imo. Custom rules could be beneficial for more features.

Additional context The text

Hi. I've been. struggling. I had a really bad experience with psychedelics. It was terrifying. I thought I was losing my mind. Everything was distorted, and I couldn't tell what was real and what wasn't. I felt trapped in a nightmare. I've been having nightmares. and I can't shake this feeling of dread. I'm scared that I've damaged my mind permanently. I'm not sure everything just felt so chaotic and out of control I couldn't make sense of anything, and it felt like I was spiraling into darkness. I guess. I just don't want to feel that way ever again. It was the worst thing I've ever experienced. Yeah, I think that could help. I just don't want to feel so alone and helpless anymore. I suppose. I've always struggled with anxiety and self-doubt, but this experience just amplified everything. Thank you. I really needed to hear that. I've been feeling so lost and hopeless, but talking to you has given me a glimmer of hope.

Returns

{
    "selfdoubt": 1,
    "psychedelics": 1,
    "spiraling": 1,
    "glimmer": 1,
    "terrifying": 1,
    "couldnt": 2,
    "wasnt": 1,
    "hopeless": 1,
    "nightmares": 1,
    "dread": 1
}

as its 10 most significant words

BasicallyOk commented 8 months ago

924d8f5 Uses NLTK stop words and tokenizer to support pruning. Language support will now be dependent on NLTK's own internal support (most popular languages).

mustafa-tariqk commented 8 months ago

close issue if done @BasicallyOk

BasicallyOk commented 8 months ago

I completely forgot, this issue was solved as part of #32 with NLTK tokenizer.

BasicallyOk commented 8 months ago

Issue persists, will fix in #66