Criteria for selecting evaluation threads

snadi commented 5 years ago

Christoph: "I'm wondering if we should only focus on threads which contain sentences identified by all approaches though -- this might bias our sample towards threads with particular characteristics. Would it make sense to simply use all threads with score > 0 and answers > 1?"

snadi commented 5 years ago

@ctreude So after handling issues #11, #8, #7 , the final results for json threads since 2015 are here

These are all the stats for threads with score > 0 and at answers > 1. Highlighted rows are the threads for which all techniques produced something.

If we want to remove the restriction you talk about, should we simply randomly select 10 threads for the evaluation? Should the median num of sentences affect our decision in any form?

ctreude commented 5 years ago

@snadi Doesn't look like I have access to the spreadsheet :(

Without looking at the data, I think using a random sample of threads with score > 0 and answers > 1 would make sense, if there's a good enough chance that the majority of these threads have sentences identified by all approaches.

For median number of sentences, are you referring to the sentences picked by different approaches? If so, it could inform how we configure LexRank (i.e., how many sentences we expect in return).

ctreude commented 5 years ago

Hi @snadi Thanks for sharing the data. Given that threads with sentences selected by all approaches appear to be a minority, I wonder if we should use threads with sentences selected by at least 2 approaches? This way, we wouldn't bias it too much towards threads that contain sentences identified by our approach. It would make the sampling a bit more tricky though to make sure everything's balanced...

snadi commented 5 years ago

@ctreude Ok, I created a sheet threads_with_at_least_2_techs which has the threads satisfying that criteria.

In order to proceed with sampling, balancing etc., we need to decide how many threads in total we want to be evaluated.

If we say 20 threads and we say we want 5 ratings per thread/sentence (I think 10 will mean we need lots of participants and I'm worried we won't get that much), then we need a total of 100 thread ratings. Given that each participant rates 3 threads, then we need ~34 participants in total to cover those threads (hopefully, I did the math right)

Now, the question is how to sample from that refined sheet. One way is to make sure we have no disadvantage towards any technique. Since the lexrank approach guarantees the number of sentences we put anyways, this means we are trying to balance between "just if sentences", "conditional sentences", and "word patterns". Thus, can we say that we randomly select threads from that pool while ensuring that the total number of sentences for each approach (Across all threads) is comparable?

Not sure if we need to account for balancing within threads as well? In that case, I'm not sure how we can systematically do this.

ctreude commented 5 years ago

@snadi Good question. I wonder whether we could get away without balancing, by just randomly sampling from this sheet? All techniques produce sentences for the majority of threads in there.

If we try to balance, I wonder whether we'd introduce weird biases, such as comparing the usefulness of if sentences to the usefulness of conditional insight sentences in the context of threads which contain WordPattern Sentences.

snadi commented 5 years ago

@ctreude I think randomly sampling while ignoring outlier threads that have lots of identified sentences (e.g., we cannot really ask participants to rate 13 sentences in a thread) is the safest option. I tried to think about more sophisticated balanced sampling and also felt we would introduce weird biases/heuristics.

snadi commented 5 years ago

@ctreude I re-ran things and fixed a few things with the SO API (I didn't realize it was returning only a subset of the questions before due to the API limit). Given the huge amount of questions, I limited things to the last year: March 29, 2018 - March 29, 2019 (which already made us process close to 30k threads. We process only those with score > 0 and NumAnswers >= 2). In the end, there are 79 threads that have sentences identified by all three techniques. The median of the total number of sentences that would get highlighted in each of these threads is 5. Accordingly, what I plan to do is:

(1) randomly select 20 threads from those 79 threads (2) if the total number of sentences that would get highlighted for any of these selected threads is > 5, then replace it with another random selection. My reasoning is that asking the user to evaluate more than 5 sentences per thread is pretty cumbersome.

While the above may mean we are only looking at threads with particular characteristics, it also means we are fair to all techniques and that these threads likely have meaningful data to compare in the first place. It also means we no longer have to do weird balancing things.

The final set of selected 20 threads is here https://docs.google.com/spreadsheets/d/1jNHdltPfyafY7FNkY2g3dfaT5VSI0wFVHw39wLj2vL0/edit#gid=734723480 -- in total, across the threads, there are 17 if sentences, 12 conditional insight sentences, and 13 word pattern. I am concerned that the total number of sentences is not that high, but at the same time, it enables use to get a higher response rate per sentence.

Do you think that's ok or should I also add a criteria of let's say a total of at least 20 sentences per technique to be evaluated?

Sarah

ctreude commented 5 years ago

Hi @snadi, thanks for the update. I think requiring at least 20 sentences per technique would be good -- 12 and 13 sound pretty low. How difficult would it be to change this? I assume this could be achieved by increasing the number of sentences per thread (maybe 6 or 7 instead of 5) or by not sampling the 20 out of the 79 randomly.

snadi commented 5 years ago

Thinking about this, it doesn’t really make sense that there’s only 12 or 13 sentences. If we have 20 threads, there should at least be 20 per each. I’ll doible check tom. It may be a typo in one of the cells while calculating the totals. If not, then I’ll resample. It’s not too hard to redo the sampling since I don’t need to re-run the whole analysis for that.

ctreude commented 5 years ago

OK, I'll hold off on LexRank until then.

snadi commented 5 years ago

So it was just summing up the wrong columns as I expected. We have 29 regular if sentences, 20 conditional insight sentences, and 21 word pattern sentences. I think these numbers are good. I’ll create the data u need to run LexRank tom (similar format to last time)

ctreude commented 5 years ago

Perfect, thanks!

snadi commented 5 years ago

You will find the data for the new set of 20 selected threads here https://github.com/ualberta-smr/Benyamin-Conditional-Insights-Extraction/tree/master/lexrank. Could you please let me know once you run lexrank on them? Please use the same format you used last time (see Issue #10). Thanks!!

ctreude commented 5 years ago

Done!

https://github.com/ualberta-smr/Benyamin-Conditional-Insights-Extraction/blob/master/lexrank/lexrank-results.csv

ualberta-smr / StackOverflowNavCues

Criteria for selecting evaluation threads #9