Open fedor57 opened 6 years ago
Just to let you know. I was involved once in one of the search giants in calculating kind of freshness PageRank over constantly changing web graph. The algorithm somehow accumulated weight diff and distributed it to peers when weight exceeded some threshold. Also there were some heuristics to intensify processing near new nodes with big weights.
Regarding convergence in incremental scenario: perhaps we can backup values from previous steps and update peers of the worker / task in case there is a big change in value with a flag "include in the next partial iteration". Then run some partial iterations with full ones every 5 partial. If believe that such a technic could produce a VERY fast dawid skene algorithm implementation. ;) Especially for the incremental scenario.
Hi,
One way to achieve the first two points would be to use an online algorithm.
question_classes[i, :]
before this line to view the confidence that each label is correct for the ith question.
Hi, was able to use the aggregator actually, thank you very much!
It has squeezed out 1.8M responses into 500K labels using 4.5 hours on 1 thread on server and 35Gb of memory ;) I think we can incorporate the solution, but I need to implement some enchancements to make it more usefull in production scenario. I will share my thoughts here just to let you know what we think would be useful in our real situation:
for the MLtoRank scenario we need to constantly get new labels and aggregate new judgements. 5 hours delay for adding an extra 1000 labels may be too long and too much electricity to burn. So we need to learn how to perform incremental step. That may include a possibility to backup all distributions and other state, prefill new cells with defaults and perform 1-2 extra iterations.
the prior may be enchanced by implementing a "partial" steps that updates only part of the rows imperically close to the changed ones. Then after one partial step we can perform one full to settle down if needed.
we have a method to order extra 3 marks if the first 3 do not give a confident label. So we will need to output a confidence level for chosen label for the decision to make an extra order.
And one extra off topic:
I would be happy to hear any thoughts regarding this, thank you!