more comments - cover page and abstract

gailkaiser commented 7 years ago

year on cover page and copyright say 2016

wrt abstract: Software debugging is often understood as fixing the bug (modify and test) as well as localizing. The faster debug cycle is useless if all you do is find the bug but do nothing about fixing it. Issuing a patch and installing a patch are arguably both parts of debugging as well, although not all bug fixes result in patches, some may not be released until the next regular version release - but this is not what you're addressing, otherwise the fast debug time wouldn't matter all that much. The abstract should state which parts of this full debug cycle are considered in this thesis and which are outside of scope.

"Existing debugging mechanisms provide light-weight instrumentation which can track execution flow in the application by instrumenting important points in the application code. These are followed by inference based mechanisms to find the root-cause of the problem." Not all debugging mechanisms match this description.

"While such techniques are useful in getting a clue about the bug, they are limited in their ability to discover the root-cause (can point out the module or component which is faulty, but cannot determine the root-cause at code, function level granularity). " Hmm... you just said inferencing is used to find the root cause but then the next sentence says but they don't find the root cause.

"Another body of work uses record-and-replay infrastructures, which record the execution and then replay the execution offline. These tools generate a high fidelity representative execution for offline bug diagnosis, at the cost of a relatively heavy overhead, which is generally not acceptable in user-facing production systems." This is over-stated. There are record/replay mechanisms claiming very low overheads, particularly for interactive systems where user think/idle time is much higher than the recording overhead.

"Therefore, to meet the demands of a low-latency distributed computing environment of modern service oriented systems, it is important to have debugging tools which have minimal to negligible impact on the application, and can provide a fast update to the operator to allow for shorter time to debug." You could say ideally there would be literally zero overhead, but that probably isn't feasible for bugs that reach deployment - bugs found and fixed before deployment indeed do have zero overhead on production. Arguably all bugs could be found and fixed in advance, but there's a tradeoff in time to market vs. return on investment in pre-testing.

:"Having a shorter debug cycles and quicker patches is essential to ensure application quality, and reliability" -> Having short, no a. Run a grammar check.

" Secondly, live debugging should not impact user-facing performance for non bug triggering events. In large distributed applications, bugs which impact only a small percentage of users are common. In such scenarios, debugging a small part of the application should not impact the entire system." Well, ideally live debugging does not impact performance unless there is a bug, but even just monitoring for bugs is likely to have some impact even when no bug found yet. Also, there is not necessarily a correlation between bugs impacting only a small percent of users and bugs involving just a small part of the code. For example there could be some error that ranges widely throughout the code but affects only those users running on the rare blahblah configuration.

"With the above stated goals in mind, we have designed a framework called Parikshan 1 , which leverages user-space containers (OpenVZ/ LXC) to launch application instances for the express purpose of live debugging" Adding containers inherently affects production, if the production system also uses containers not just the debug version. Should phrase in terms of containers becoming popular for the production system for this long list of reasons that are orthogonal to debugging.

"Parikshan is driven by a live-cloning process, which generates a replica (debug container) of production services for debugging or testing, cloned from a production container which provides the real output to the user. The debug container provides a sandbox environment, for safe execution of test-cases/debugging done by the users without any perturbation to the execution environment" You're using the term user to refer both to the end user and to the user of the debugging system, i.e., developers, this is confusing.

" As a part of this framework, we have designed customized-network proxy agents, which replicate inputs from clients to both the production and test-container, as well safely discard all outputs from the test-container." Why called debug container and then suddenly test container, not necessarily obvious these are the same.

"We believe that this piece of work provides the first of it’s kind practical real-time debugging of large multi-tier and cloud applications, without requiring any application down-time, and minimal performance impact." There is some downtime during cloning. Is there previous impractical real-time debugging? What is "minimal", that implies some kind of optimization that I don't think can be proved. What do real people do now if there's no practical real-time debugging?

"The principal hypothesis of this dissertation is that, for large-scale service-oriented-applications (SOA) it is possible to provide a live debugging environment, which allows the developer to debug the target application without impacting the production system. " The SoA target should be explained earlier, before mentioning containers at all, and also before network proxy or anything else that assumes SoA.

"As a part of this description, we will also show case-studies demonstrating how network replay is enough for triggering most bugs in real-world applications." What you're actually doing is introducing network replay as the solution to the how do I do live debugging problem. This should be directly contrasted with record/replay not manage paragraphs later.

"To show this, we have presented 16 real-world bugs, which were triggered using our network duplication techniques. Additionally, we present a survey of 217 bugs from bug reports of SOA applications which were found to be similar to the 16 mentioned above." The 16 versus the 217 is weird. Are these 16 cherrypicked bugs or representative bugs? Why only 16? Are there only 217 similar bugs out of millions of bugs?

"Secondly, we will present iProbe a new type of instrumentation framework, which uses a combination of static and dynamic instrumentation, to have an order-of-magnitude better performance than existing instrumentation techniques" Sudden switch of topic. Need leadin based on the debug containers in your solution are instrumented, and without your network replay solution the production containers would also have to be instrumented - and you developed iprobe before parikshan so initially the production containers were indeed instrumented so low performance impact was critical. But also explain why we don't want arbitrarily slow instrumentation even in the debug containers (debugging window and sync issues?). Also, what is static vs dynamic about it?

"or can be used in our debug container with Parikshan to assist the administrator in debugging." the administrator suddenly comes out of nowhere, no previous mention.

"Lastly, while Parikshan is a platform to quickly attack bugs, in itself it’s a debugging platform. For the last section of this dissertation we look at how various existing debugging techniques can be adapted to live debugging, making them more effective. We first enumerate scenarios in which debugging can take place: post-facto - turning livedebugging on after a bug has occured, proactive - having debugging on before a bug has happened. We will then discuss how existing debugging tools and strategies can be applied in the debug container to be more efficient and effective. We will also discuss potential new ways that existing debugging mechanisms can be modified to fit in the live debugging domain." This is confusing, what does it mean to quickly attack bugs and why is that a different concept that debugging? Need to distinguish between debugging mechanisms and debugging platform from getgo, where previously the debugging platform was an instrumented version of the production system run offline (or something like that) and the mechanisms were applied there. In first few paragraphs, it sounds like parikshan is an alternative to debugging mechanisms, not complementing them or providing a better place to use them.

I know you can't put the entire thesis in the abstract, but some reorganization may help lead the reader by hand from one concept to next, and then the rest of this material goes in intro.

nipunarora commented 7 years ago

year on cover page and copyright say 2016

fixed

wrt abstract: Software debugging is often understood as fixing the bug (modify and test) as well as localizing. The faster debug cycle is useless if all you do is find the bug but do nothing about fixing it. Issuing a patch and installing a patch are arguably both parts of debugging as well, although not all bug fixes result in patches, some may not be released until the next regular version release - but this is not what you're addressing, otherwise the fast debug time wouldn't matter all that much. The abstract should state which parts of this full debug cycle are considered in this thesis and which are outside of scope.

ok I will try to mention in the abstract the specific parts of the debug cycle that we focus on - localizing/ finding the bug. Additionally, it is arguable if finding the bug and not actually fixing it is not useful. Several SOA systems require immediate debugging, most of this process involves the developer trying to locate the bug offline. While we may not help in patch testing, we definitely reduce the time for creating a patch by localizing the error. So it's connected. I agree for enterprise systems where bug fixes/patches are done in every release cycle this may not be the case, and fast bug resolution does not matter as much. I will try and clarify the two scenarios.

"Existing debugging mechanisms provide light-weight instrumentation which can track execution flow in the application by instrumenting important points in the application code. These are followed by inference based mechanisms to find the root-cause of the problem." Not all debugging mechanisms match this description. "While such techniques are useful in getting a clue about the bug, they are limited in their ability to discover the root-cause (can point out the module or component which is faulty, but cannot determine the root-cause at code, function level granularity). " Hmm... you just said inferencing is used to find the root cause but then the next sentence says but they don't find the root cause.

Essentially they can fix it to an extent. Edited the above by making the following change: "Existing debugging mechanisms provide light-weight instrumentation which can track execution flow in the application by instrumenting important points in the application code. These are followed by inference based mechanisms to localize the error. While such techniques are useful in getting a clue about the bug, they are limited in their ability to discover the root-cause (e.g. can point out the module or component which is faulty, but cannot determine the root-cause at code, function-level granularity)."

nipunarora commented 7 years ago

:"Having a shorter debug cycles and quicker patches is essential to ensure application quality, and reliability" -> Having short, no a. Run a grammar check.

fixed

"With the above stated goals in mind, we have designed a framework called Parikshan 1 , which leverages user-space containers (OpenVZ/ LXC) to launch application instances for the express purpose of live debugging" Adding containers inherently affects production, if the production system also uses containers not just the debug version. Should phrase in terms of containers becoming popular for the production system for this long list of reasons that are orthogonal to debugging.

containers becoming popular in production systems has been stated in introduction. Further in Parikshan's overhead explanation, I have also stated that containers have performance comparable to "native" execution, based on several papers.

"Parikshan is driven by a live-cloning process, which generates a replica (debug container) of production services for debugging or testing, cloned from a production container which provides the real output to the user. The debug container provides a sandbox environment, for safe execution of test-cases/debugging done by the users without any perturbation to the execution environment" You're using the term user to refer both to the end user and to the user of the debugging system, i.e., developers, this is confusing.

Yes I think I need to clarify and keep this consistent someplace. I'll see if there can be something in the introduction which spefies roles, and which can be re-used in the thesis. I have explained somewhere that I use the term debugger/and user interchangably, but I'll make this consistent, and will avoid using "administrator".

" As a part of this framework, we have designed customized-network proxy agents, which replicate inputs from clients to both the production and test-container, as well safely discard all outputs from the test-container." Why called debug container and then suddenly test container, not necessarily obvious these are the same.

Fixed, calling it debug-container to be consistent

"The principal hypothesis of this dissertation is that, for large-scale service-oriented-applications (SOA) it is possible to provide a live debugging environment, which allows the developer to debug the target application without impacting the production system. " The SoA target should be explained earlier, before mentioning containers at all, and also before network proxy or anything else that assumes SoA.

Fixed

nipunarora / parikshan

more comments - cover page and abstract #9