Setting analysis widget

navigating-stories / orange-story-navigator

Add-on to the Orange3 data mining toolkit with text processing widgets from the project Navigating Stories

https://research-software-directory.org/projects/navigating-stories

Other

2 stars 2 forks source link

Setting analysis widget #14

Open kodymoodley opened 7 months ago

kodymoodley commented 7 months ago

Implement one feature for analysing the setting of a story:

One approach could be to obtain a list of keywords / uniquely identifying words from the story, say 'kw'.
Thereafter, we could find the 'closest' N words to each word in 'kw' within a pretrained embedding space for Dutch
The cluster(s) of these words (t-sne) could be rendered to the screen to inform the setting

f-hafner commented 7 months ago

We defined the following subtasks:

[x] start from corpus of stories
[x] remove stopwords
[x] lemmatize
[x] put into a dataframe together with storyid and segment id
[ ] prepare embeddings: @kodymoodley finds out which model to use
[ ] extract similar words in embedding space

f-hafner commented 7 months ago

Questions to discuss

I am reusing the spacy model loaded for other tasks. is this ok here?
- for instance, the "merge_noun_chunks" is added to the nlp model. Then, "Mijn eerste vriendje" becomes ["mijn een vriendje"]; if this is not added, we have ["mijn", "een", "vriendje"]
refactoring
- structure between tagger and setting analyzer are now quite similar, maybe we can think of combining them?
- test the function util.is_valid_token(); reuse in tagging.py

kodymoodley commented 6 months ago

We defined the following subtasks:

[x] start from corpus of stories

[x] remove stopwords

[x] lemmatize

[x] put into a dataframe together with storyid and segment id

[ ] prepare embeddings: @kodymoodley finds out which model to use

[ ] extract similar words in embedding space

Thanks very much @f-hafner ! This is already super helpful to have completed the preprocessing. The lead applicants have recently informed me that they would like to pause on the Setting widget until after the workshop. So this feature is no longer required for the workshop in April. But I / we could resume where you left off after the workshop.

kodymoodley commented 6 months ago

Questions to discuss

I am reusing the spacy model loaded for other tasks. is this ok here?

for instance, the "merge_noun_chunks" is added to the nlp model. Then, "Mijn eerste vriendje" becomes ["mijn een vriendje"]; if this is not added, we have ["mijn", "een", "vriendje"]

refactoring

structure between tagger and setting analyzer are now quite similar, maybe we can think of combining them?

test the function util.is_valid_token(); reuse in tagging.py

@f-hafner, will revisit this comment in April / May. Right now, I suspect that merging the noun chunkswould not be necessary for what we want to do.