How to use termWeights option

MattJBritton commented 6 years ago

Hi! My name is Matt Britton, I'm a student at Georgia Tech. My advisor is Alex Endert, a member of John Stasko's department.

I am working on a project to use Sententrees in a visualization of threaded replies in a forum (e.g. Reddit). My objective is to make it easier to navigate and summarize a large conversation.

I have a prototype created with a working Sententree, but the algorithm tends to choose irrelevant words with low content value, e.g. I, would, think, not, like, etc. My guess is that these words predominate because the text in a forum, unlike tweets, has a lot more structure and includes more prepositions, articles, conjunctions, etc. than the corpus used in your examples.

I'm looking at ways to address this, and before I do my own text preprocessing, I'd like to investigate the "termWeights" object that can be passed to SententreeModel() as part of the "options" parameter. It looks like this value is parsed and passed to SententreeModel.growSeq(), but from what I can tell, it is not actually implemented there yet.

Can you confirm that my understanding of this code is correct? If so, I may choose to implement weighting myself - can you give me a sense of what you envisioned for this feature and how you intended it to function?

Best,

Matt

MattJBritton commented 6 years ago

Example Sententree generated from my data

mengdieh commented 6 years ago

For the SentenTree demo we just remove stopwords such as I and the (see filter.js) after parsing the text. Down-weighting these words through termWeights will certainly work as well. termWeights is supposed to be customizable. One use case I have in mind is to add a tf-idf score through termWeights for multi-topic datasets.

MattJBritton commented 6 years ago

Thanks @mengdieh ! Appreciate your response, and sorry to take so long to get back to you. It took me an embarrassingly long time to realize that WordFilter.js wasn't converting to lower case before it checked against the stopwords list, and that was the main source of my problem!

Unfortunately I'm still getting unsatisfactory results using my Sententree to summarize reddit threads (see image below). I believe that my corpus is significantly more heterogenous than the tweets you guys used, and so there is relatively little overlap between the posts aside from rather uninteresting words. I'm going to keep working on this, but would appreciate any insights you have from developing the tool. My text source is here.

Also, am I correct that the termWeights function is not currently fully implemented? It doesn't seem to do anything if I pass in a non-empty termWeights as a parameter. I may take a stab at implementing it but just wanted to confirm that my understanding was correct before continuing.

twitter / SentenTree

How to use termWeights option #5