sileod / DiscSense

Automated Semantic Analysis of Discourse Markers
10 stars 4 forks source link

Confidence and Prior? What do you mean by those? #1

Open brucewlee opened 2 years ago

brucewlee commented 2 years ago

Hi :) Thanks for the great work. DiscSense deserves more recognition. It reveals so much more potential for discourse analysis, especially pertaining to its role in semantics.

As a peer researcher in the similar field, a great use case of DiscSense is the understanding of text semantics through simple token matching. If the semantic labels presented in DiscSense were meaningful enough, such a semantic analysis system would be possible even without sophisticated BERT-like encoders involved.

However, you mention Confidence (Prior) calculations. I read your Paper but it is difficult to conceptually grasp what you mean by "Confidence" and "Prior".

How exactly are these values computed (I find it unclear in your LREC paper)? And what do you qualitatively mean by "Confidence" and "Prior"? I hope I could receive some help here.

sileod commented 2 years ago

Hi, thank you for these kind words ! I also think that it can reveal dataset biases and connotations of markers.

I heavily relied on the association rules terminology, which is a bit old fashion now. I mine marker=>label rules in specific datasets. But labels are unbalanced and a label can be dominant. If a label y is dominant, any marker=>label rule will be accurate. The prior is the probability of getting the label regardless of the discourse marker presence.

The confidence is the probability of the rule marker=>label being true in a dataset. In the CR dataset, if you encounter "sadly," the review has a 95.2% chance of being negative. It is the confidence of the sadly=>negative association in CR. In the CR dataset, a review has a 21.8% chance to be negative in general (which is the prior for negative in CR). See table 2 of the paper

brucewlee commented 2 years ago

@sileod Thanks for the explanation!