comments on abstract - Githubissues

"Existing debugging mechanisms provide light-weight instrumentation which can track execution flow in the application by instrumenting important points in the application code. These are followed by inference based mechanisms to find the root-cause of the problem. While such techniques are useful in getting a clue about the bug, they are limited in their ability to discover the rootcause. Another body of work uses record-and-replay infrastructures, which record the execution and then replay the execution offline. While record and replay infrastructures generate a high fidelity representative framework for bug diagnosis, they suffer a heavy overhead generally not acceptable in user-facing production systems."

I don't know what you have in mind for the inference based mechanisms, or why they would be "limited" in ability to discover root cause. In any case, no clue is given about how the work removes/avoids this "limited" issue. I hope this is elaborated in the intro. The following paragraph of the abstract addresses the overhead problem wrt record/replay, but its unclear why minimal impact would be better/different than light-weight, and most significantly does not address the "limited" problem.

"Parikshan is driven by a live-cloning process, which generates a replica (debug container) of production services for debugging or testing, cloned from a production container which provides the real output to the user. The debug container provides a sandbox environment, for safe execution of test-cases/debugging done by the users without any perturbation to the execution environment."

Confusion here as to who is the user. If the real output to users comes from the production container, how do the users see the test cases/debugging?

"As a part of this framework, we have designed customized-network proxy agents, which replicate inputs from clients to both the production and test-container, as well safely discard all outputs from the test-container."

Do you truly discard? It would be better to hide from end-users, but retain for comparison to production outputs and also for offline analysis, perhaps to help in debugging the debugger.

"This thesis also looks into light-weight instrumentation techniques, which can complement our live debugging environment."

But you just said that the inference based techniques already have light-weight instrumentation.

"Additionally, we will demonstrate a statistical debugging mechanism that can be applied in the debug-container to gain insight and localize the error in real-time."

How is statistical debugging better/different from inferencing? Inferencing is often based on statistics.

"As a part of this description, we will also show case-studies demonstrating how network replay is enough for triggering most bugs in real-world application."

Seems this should be emphasized more, since the main result of this thesis may be that high fidelity record/replay is unnecessary overkill for many real-world bugs. However, should characterize briefly here what kinds of bugs would need more complete recording not just network inputs.

"Secondly, we will present iProbe a new type of instrumentation framework, which uses a combination of static and dynamic instrumentation, to have an order-of-magnitude better performance than existing instrumentation techniques. While live debugging does not put any performance impact on the production, it is still important to have the debug-container as much in sync with the production container as possible. iProbe makes applications live debugging friendly, and provides an easy way for the debuggers to apply probes in the debugging sandboxed environment."

A sentence or two about iProbe's secret sauce (key insight) is warranted here, what makes it so much better and why is sync important?

nipunarora / parikshan

comments on abstract #1