soberbichler / Using-LDA-and-Jensen-Shannon-distance-to-separate-relevant-from-non-relevant-articles

9 stars 2 forks source link

Question #1

Open amnpawar opened 1 year ago

amnpawar commented 1 year ago

@soberbichler Can you please throw some light on the most_similar_df factor where you have taken 17 as a threshhold value for filtration So incase I execute it over different dataset how can i determine this value what should I take Line [34] in notebook sum(most_similar_df['relevancy'])> 17:

Also in case we alter the value check for sum(most_similar_df['relevancy']) to any value greater it also affects Non_rev_0 values i mean they increase which increase result_right value for us. So how come I can be assured that the values in Non_rev_0are the ones which do have a relevancy for me.

Also if I am not wrong Line [36] in notebook all_ = len(non_rev_3) + len(rev_0) + len(non_rev_0) + len(rev_3) this denotes confusion matrix right

soberbichler commented 1 year ago

Dear amnpawar,

For the first question: The value was mainly determined by close reading of the results. Also, the confusion matrix helped (which answers your last question :-). For the second question: I am nut sure if I understand correctly but for me, close reading of the results is crucial in order to understand the outcome and in order to adapt the code.

Best, Sarah

amnpawar commented 1 year ago

Hey @soberbichler thank you for your response but I still didn't get close reading of results like how it help you determine the threshold can you throw some light over it with a small example.

Actually second question is somewhat related to 1st and changing the threshold varies the results like as below chunk sum(most_similar_df['relevancy']) > 17:

is deciding too which fields to move to relevant and which one to non-rev . Altering something else in place of 17 alters complete result which you denoted via result_right = len(non_rev_0) + len(rev_3)