Spike: Explore changing the stitching threshold in Riff-server Utterances.

jaedoucette commented 5 years ago

Describe the ideal solution or feature request

Riff-server currently "stitches" together any two utterances made by the same user that occur within 5 seconds of each other. This is probably too wide to allow the accurate detection of interruption attempts unless the attempts are successful. We would like to reduce this threshold.

A possible reason for this behavior is that the computation of influence events uses a quadratic time algorithm, and the algorithm is run every time the user visits the dashboard. This means that conversations with more than a few thousand utterances are likely to cause the dashbaord to hang. By stitching the utterances together so aggressively, the total number of utterances in a meeting is sharply reduced.

To fix this, we would need to change the threashold (easy), and improve the quadratic time algorithm used to compute the data that is displayed in the influence charts (harder).

Acceptance Criteria

[ ] Report the change in the number of typical utterances when a threshold of 1s is used for stitching utterances together in riff-server, rather than a threshold of 5s.
[ ] If the new total number of utterances is expected to exceed 1000/hour in a 4 person meeting, measure the speed of the current influence calculation for inputs of size 1000, 5000, and 10000.
[ ] Implement a sub-quadratic algorithm for computing influence if the apparent runtime of the influence calculation is high enough for a user to detect it during the dashboard rendering of 10,000 utterance meetings. Measure the improvement in runtimes. Report an estimate of the maximum meeting length and size that we support with the existing algorithm.

The algorithm can be improved to parametric linarithmic time (which should let it support tens of millions of utterances without hanging) as shown below. The algorithm I have picked is intended to be easy to implement and requires only common data structures (specifically, a min-heap). We can achieve strict-linarithmic time by using a 2d-range tree instead, but I am less confident that these will be implemented well if we just pull a random node package, because they are a little more exotic.

Let weightedAdjacencyMatrix be a dictionary mapping participants to dictionaries of participants to natural numbers, representing the combinatorial graph of influence events between participants.
Construct a min-Priority Queue of the utterance events by startDate
Construct a min-Priority Queue of the utterance prioritized by endDate.
While the startDate Priority Queue still contains elements, remove the top element S: -- While the top element of the endDate Priority Queue has an endDate that comes before the start date of E, remove the top element of the endDate Priority Queue E. --- If E's endDate is within 3 seconds of the startDate of S: ---- Add E to the EndDate PriorityQueue again (we might need it again). ---- weightedAdjacencyMatrix[E.participant][S.participant] += 1 (with 0 as default value).
Return weightedAdjacencyMatrix

This algorithm has asymptotic complexity of O(klog(n)n) where k is the maximum number of utterances that occur within a 3-second interval, and n is the total number of utterances. It is reasonable to expect that k << n for any input where runtime is important, so this is parametric linarithmic time, and should scale well.

mlippert commented 5 years ago

What is the ballpark number of utterances in a 4 person one hour meeting? (not stitched utterances, the actual utterances we store in the DB). And then maybe the number of stitched utterances in that same meeting.

This is interesting near term in that we can easily do the stitching on the client before doing the calculating the influence/interruption/timeline data, and we can then stitch differently for those 3 different analyses.

The 2 downsides to doing this that I can think of are memory usage on the client for the larger number of utterances, and longer time to transfer the larger number of utterances over the network.

Long term, I expect calculating this analytic data will be done on the server and stored in a DB, and the raw utterances will not be getting sent to the client. But that's a ways down the road.

jaedoucette commented 5 years ago

@mlippert This is a good thing to check. I took a few minutes for a quick look.

From the Fullbridge dataset, it looks like without stitching, a typical number of utterances is about 150-200 per person for a 60 minute meeting, though some people speak much less than this, and people with something in the background (I think), can have thousands. With 5s stitching, it looks like this is roughly cut in half. So the raw set should be (roughly) twice as large as the stitched set, except for the weird cases where someone has thousands of utterances because there's a fan behind them or something.

This also makes me think this is probably a lower priority concern: Unless you get 10 people together, or hold a multi-hour meeting, it doesn't seem likely that the quadratic runtime will be noticeable to the user. Maybe we don't care about those edge cases?

mlippert commented 5 years ago

@jaedoucette You're probably right that improving the algorithm used to calculate the influence analytic data is of lower priority, but having read through the algorithm you described above, I was/am looking forward to implementing it! It's something I'd particularly enjoy doing. Maybe not tomorrow though.

Those numbers are good to know though, I thought there could be 1000's per person per hour, and perhaps there will be more when we've got improved speech detection, but probably not an order of magnitude more, so that's good (although something we should look at).

jaedoucette commented 5 years ago

@mlippert I thought some more about this, and we can actually be a bit more clever than the algorithm in my earlier post. We don't save a ton with this, but it's a little more satisfying.

Key update is that we don't need to add all the utterances to the endDate priority Queue at the start, because they can only possibly be relevant if they've already been removed from the startDate queue. The algorithm works the same as before, except that you:

Just use a regular queue or list for the startDate queue (since I think the server sends us the data already sorted? If not, Quicksorting or TImSorting the list is likely a bit cheaper than HeapSorting. Not sure what JS uses though).
start with an empty endDate queue, and every time you take something out of the startDate queue, you put it in the endDate queue.

Everything else is the same. This costs O(n*log n) for the initial sort (but with a small constant; and we might reuse this sorted utterance data for several applications, especially if it's sorted by the server before it's sent to us), and then O(n*k*log k) for the processing work, so it swaps an O(log k) factor for an O(log n) factor during the processing. My guess is that this could be an extra 5x faster., and it's actually a little better than the 2d Range Tree algorithm (which would be O(n*(k+log n)), but with larger constants than a simple sort).

mlippert commented 5 years ago

I like it. That was a nice realization that definitely streamlines the algorithm and makes it more efficient.

The initial utterances are already sorted, the code that retrieves them sorts them using the builtin sort of arrays.

It turns out that requesting that the server return them sorted doesn't work for stitched utterances (I discovered that during my 1st refactor) so the client sorts them after they've all been retrieved since every other use of them on the client needs them sorted.

So they have to be added to the endDate queue in a way that keeps that queue sorted by end date, right (ie insertion sort)?

jaedoucette commented 5 years ago

@mlippert I see now that this is not clear from what I wrote, but I intended for the endDate queue to remain a min-priorityQueue, even though the startDate queue can just be a regular queue now. I think a heap is the right structure to use for the endDate queue.

Actually thinking about this more, maybe we can get O(n*k) time by using two regular queues in place of the priority queue.

Idea is: -> As things are removed from the startDate queue, insert them in sorted order into the endDate queue. This costs O(k) per element, or O(n * k) in total. -> As you remove items from the endDate queue, put them in a second queue. -> When you're done removing items from the endDate, make this second queue the endDate queue, and make the (now empty) endDate queue take on the role of the second queue during the next iteration of the algorithm's main loop.

That gets us O(n * k) total runtime if the list is already sorted. Hard to see how we could do better than that!

adonahue commented 5 years ago

@jaedoucette - what priority (for the next sprint and in general) would you give this work, and why?

jaedoucette commented 5 years ago

@adonahue Brec and I did the "hard" part of this during our work on interruptions, so this is probably now a '1'. Priority is low-ish. We might do this incidentally as part of other feature work anyway.

brecriffs commented 4 years ago

I don't really understand the part about influence being a quadratic time algorithm. Doesn't it just see if another utterance ended within 3 seconds before? Is the point that, if we reduce stitching, and there are more utterances, computing the pairwise events will be more costly, since it will have to parse through more utterances? Does tagging utterances on the back-end solve this?

jaedoucette commented 4 years ago

@brecriffs Influence was quadric time, because the way it was being computed was by comparing every utterance to every other utterance (for n(n-1)/2 = O(n^2) comparisons in total). You and I rewrote it to do a linear number of comparisons instead, so it's no longer slow.

So previously, if we quadrupled the number of utterances (which I feated we might), we'd increase the runtime by a factor of 16. Now we increase it only by a factor of 4.

jaedoucette commented 4 years ago

I sampled 630,000 utterances, consisting of every utterance made during our most recent Next Canada course.

The plot below shows the cumulative density function for the distribution of the time elapsed between every two utterances made by the same speaker (clipped at 5 seconds).

The way to read this plot is that the x-axis shows the time a person was quiet between two utterances, in hundredths of a second. The y axis shows the percentage of all utterance pairs that would be stitched together if we used a given x-axis value as our cutoff for stitching.

Roughly 70% of the probability mass is concentrated in pauses of between 0.5 seconds and 2 seconds. After the 1.75 second mark, there is a very long tail of pauses that seem to me more likely to represent someone making a separate utterance, rather than pausing in the middle of an utterance. An additional 10% of the mass lies between 2 and 3 seconds, and this might be a sort of "gray area".

As a short-term fix, I recommend that we set the stitching threshold to 1.75 seconds, or to 3 seconds. We may want to test both values for a while, and see how they feel to us.

mlippert commented 4 years ago

That is an interesting graph, but I'm lost as to the units of the x axis, based on your text I'm sort of guessing it's in units of 1/100's of a second? But I'm not really sure.

jaedoucette commented 4 years ago

Some revisions after poking around more. Here's an updated graph with clearer units, and where we go out to 60 seconds.

You can see that almost all the probability mass is concentrated in the first few seconds.

Specific cutoffs that are of interest to us are:

1.8 -> 70% stitching 2.3 -> 75% stitching 3.5 -> 80% stitching 5.0 -> 85% stitching

This means that if we moved to a stitching threshold of 1.8s, we should see roughly twice as many utterances as before. A threshold of 3.5% would give use 30% more, and 2.3s would give us 60% more.

Looking at this, it seems to me like 2.3 is a pretty reasonable cutoff for stitching. It's still a little on the long side, but it seems easier to justify that value given the distribution of gaps.

adonahue commented 4 years ago

@jaedoucette - do you expect any possible problems with the accuracy of the metrics if we implement the 2.3s change? Do we have a way to test / measure if the only things that change are the ones we expected?

jaedoucette commented 4 years ago

@adonahue I think should improve the accuracy of our metrics somewhat. At this point, the only good way we have to assess whether that is true is to try it out, and see whether (subjectively) this feels better to us.

I would like to have a better system than that, but we would really need to have a corpus of human-annotated conversations to be sure. @ebporter maybe making such a corpus could be an item for our second grant application. It would not be hugely expensive, just time-consuming. We'd probably want to hire some professional annotators (usually graduate students sociology or anthropology could be good sources for something like this, and they're cheap-ish). We could also use MTurk perhaps.

adonahue commented 4 years ago

Accepted. The recommended work is in story #259.

rifflearning / zenhub

Spike: Explore changing the stitching threshold in Riff-server Utterances. #209

Describe the ideal solution or feature request

Acceptance Criteria