Purpose
While trying out redeploying nodes at different times, I find that in the case when a large number of (>10) triple generation protocols are ongoing, the nodes could go into the infinite loop of: timing out the protocol for a triple id -> receiving message to generate the same triple from another node that has not timed out their protocol -> joining protocol to generate that same triple -> timing out. This happens because each node would join the triple generation protocol at a different time and nodes cannot time out other nodes' protocols.
This PR adds a cache for failed triples, so that each node will not retry any failed (including timed out) triples. Then when the other nodes do not receive responses from this node about this triple protocol, they will time it out and thus close the loop for this triple generation instead of infinite looping. They could move on to generating other triples instead of resources being stuck in infinite loops of failed triples.
What's changed in code
1) add failed_triples: triple_id -> timestamp to TripleManager
2) when a triple generation protocol fail, add the triple_id and the timestamp we find out it failed to the failed_triples
3) when we received a message to generate a triple which has ID that is in failed_triples, we skip that message, and update the timestamp of that ID in failed_triples
4) clear_failed_triples() will run at the end of run(), this clears all failed triples that either failed or messaged more than 2 hrs ago
Purpose While trying out redeploying nodes at different times, I find that in the case when a large number of (>10) triple generation protocols are ongoing, the nodes could go into the infinite loop of: timing out the protocol for a triple id -> receiving message to generate the same triple from another node that has not timed out their protocol -> joining protocol to generate that same triple -> timing out. This happens because each node would join the triple generation protocol at a different time and nodes cannot time out other nodes' protocols. This PR adds a cache for failed triples, so that each node will not retry any failed (including timed out) triples. Then when the other nodes do not receive responses from this node about this triple protocol, they will time it out and thus close the loop for this triple generation instead of infinite looping. They could move on to generating other triples instead of resources being stuck in infinite loops of failed triples.
What's changed in code 1) add failed_triples: triple_id -> timestamp to TripleManager 2) when a triple generation protocol fail, add the triple_id and the timestamp we find out it failed to the failed_triples 3) when we received a message to generate a triple which has ID that is in failed_triples, we skip that message, and update the timestamp of that ID in failed_triples 4) clear_failed_triples() will run at the end of run(), this clears all failed triples that either failed or messaged more than 2 hrs ago