timm commented 7 years ago

https://github.com/researchart/swan17/blob/master/pdf/Eiram.pdf

Predicting Rankings of Software Verification Competition

timm commented 7 years ago

AUTHORS: Important. Do NOT reply till all three reviews are here (until then, we will delete your comments)_.

Reviewer1

swan17-bestreviewer

Recommendation (select one)

[ ] Accept
[x] Accept if (see below)
[ ] Reject

Summary (1 para)

The paper uses a newly defined abstract representation of programs to train predictors for choosing the "right" verification tool. They are able to outperform previous predictors.

Advocacy (accept since, reject since, 1 para)

While my impression is that the usage of formal verification tools in practice is still very limited, such an approach, if done in an accessible way for practitioners, might be a way to improve that. The source code representation and learning seems to be carefully done. The predictors are able to outperform previously proposed predictors.

List of "Pros"

Nice representation of programs combining control flow, dependence and abstract syntax trees.
Outperforms previous predictors

List of "Cons"

A clear discussion of how that could be used in practice is missing. It is hinted at a practitioners could use it to choose a verification tool. I think this needs more discussion.
The clone detection community has long put a lot of effort in finding representations for similar programs. This is not discussed.

Changes needed before I can recommend accept (if any)

Discussion of how this could actually be used in practice. Add related work from clone detection community.

timm commented 7 years ago

_AUTHORS: Important. Do NOT reply till all three reviews are here (until then, we will delete your comments)_.

Reviewer2

Insert reviewer github id here ==> anonymousReviewerX

Recommendation (select one)

[ ] Accept
[x] Accept if (see below)
[ ] Reject

Summary (1 para)

This paper proposes a method for predicting the rankings of tools on programs, attempting to answer the question: "What software verification tool [should] I use for showing correctness of my program?"

Advocacy (accept since, reject since, 1 para)

In general, the paper is well-written and ventures into an interesting area of research that is on topic with the workshop.

List of "Pros"

Well-written manuscript
Interesting (underexplored?) area of software analytics

List of "Cons"

Highly technical approach needs to be explained more clearly. It was not clear to me exactly how the classifiers were being trained. What are the metrics that are being used to train it?
Take-away message is a little unclear. What should researchers/practitioners do now that we understand your results?

Changes needed before I can recommend accept (if any)

Consider the feedback above.

Minor concerns

Do not treat citation markers as nouns in your sentences (e.g., "the one of [3] as well as the default predictor on all tasks.").

timm commented 7 years ago

_AUTHORS: Important. Do NOT reply till all three reviews are here (until then, we will delete your comments)_.

Reviewer3

Insert reviewer github id here ==> swan17reviewer1

Recommendation (select one)

[X] Accept
[ ] Accept if (see below)
[ ] Reject

Summary (1 para)

The paper presents a kernel-based prediction for software verification tasks. The focus of the paper is on software verification competitions. The method is based on the label ranking algorithm. The kernels are based on different elements of the source code including control flow, data flow, CFG and program dependence graphs. The results were compared with a random predictor and a prior related work and outperforms them both giving an accuracy of approx 68% across three data sets.

Advocacy (accept since, reject since, 1 para)

I advocate for 'accept' because the paper describes a nice problem and presents a small experiment evaluating the approach. It will provide for good discussion and more feedback during the conference.

List of "Pros"

kernel-based approach for verification
comparison of results to prior work
running example

List of "Cons"

stakeholder is not clearly mentioned
scalability is not addressed. But this might depend on my previous point on who and what the focus of this prediction is for.
no information on data sets provided.

Changes needed before I can recommend accept (if any)

If the points below are explained in the paper even if just briefly, it would make the paper more accessible. I understand the 4 page limit.

Table 1 can be eliminated and the details of the last three can be put into the caption of figure 2. The rest are self explanatory.
The idea presented in the paper is reasonable however I was a little unsure the target audience i.e. stakeholders that will use this approach. The title suggests that this is specific for verification competitions. In this case, does one really need a verifier to determine performance beforehand? In contrast, the abstract talks about building portfolio solvers or just chooses a verification tool for a program, implying this is not for verification competitions. Please provide some clarification on this.
make clear if who the target audience of the tool will be: competitions, as suggested by the title or general practitioners/researchers or both.
Could you explain just briefly what the generic features are (page 2, line 34)? What would the generic features be for the program in Figure 1?
What were the useful features chosen for each of your experiment data sets? Were they different for the three data sets? What distinguishing patterns did you find? I would expect at least a small description on these in your results. It seems to abruptly end.
put the numbers instead of "rather high predictive accuracy" in your abstract. I really would like to know up front what that means.

timm commented 7 years ago

Authors? Discuss?

FYI- I would say that good responses to the above could lead to acceptance.

Note that reviewer3 has given you an "out" for doing extensive revisions ("I understand the 4 page limit.").

Eiram commented 7 years ago

First of all, we would like to thank the reviewers for their helpful comments. Moreover, we apologize for the late reply (we have been/still are on travel).

@swan17-bestreviewer (Reviewer 1)

We agree that the title and abstract, putting much emphasis on software verification competitions, are slightly misleading. Actually, the main interest in the competitions is due to the data they provide for evaluating our method. In practice, our method can of course be used in a more general way, and for predicting the usefulness of verification tools also outside competitions (e.g., developers or researchers can use it to choose a verification tool for a particular program or to build a portfolio solver). We'll make this more clear in the introduction of the paper, and also change the title to "Predicting Rankings of Software Verification Tools".
Thank you for the hint to the clone detection community. It is interesting to see that this community also uses kernel approaches to describe similarities between (sub)programs, and we will point out this connection in the paper. Yet, the kernels differ from ours and, moreover, are used to solve a different problem.

@anonymousReviewerX (Reviewer 2)

We will try to improve the presentation of technical details as far as possible within the space restrictions. Especially, we'll try to become more precise when describing the technical aspects of training the classifiers.
To overcome the problem with the "take away message", we plan to change the title to "Predicting Rankings of Software Verification Tools" and adapt the abstract and introduction to describe the use case of a predictor for rankings of software verification tools more clearly. We also plan to add another sentence in the conclusion to address this concern. Indeed, with a focus on software verification competitions, our current title and abstract are slightly misleading. Actually, the main interest in the competitions is due to the data they provide for evaluating our method. In practice, our method can be used in a more general way, and for predicting the usefulness of verification tools also outside competitions (e.g., developers or researchers can use it to choose a verification tool for a particular program or to build a portfolio solver).
Finally, we already changed the text s.t. citation markers are no longer used as nouns.

@swan17reviewer1 (Reviewer 3)

We will implement your suggestion and remove Tab. 1.
Meanwhile, we also noticed the problem with our title and understand the confusion w.r.t. the target audience. To overcome these problem, we will change the title to "Predicting Rankings of Software Verification Tools". Additionally, we plan to adapt our presentation accordingly. Indeed, with a focus on software verification competitions, our current title and abstract are slightly misleading. Actually, the main interest in the competitions is due to the data they provide for evaluating our method. In practice, our method can of course be used in a more general way, and for predicting the usefulness of verification tools also outside competitions (e.g., developers or researchers can use it to choose a verification tool for a particular program or to build a portfolio solver).
On an abstract level, generic features are substructures of a certain size. For example, if data objects are graphs, a feature may correspond to a subgraph, and the value of the feature is 1 if the subgraph is present and 0 otherwise. In our case, generic features are substructures of a certain depth that occur in our graph representation of the program. Amongst others, program Psum contains two integer declarations which are preceded by a declaration and two assignments which assign a small integer number to a variable and are preceded by an assignment. We will clarify this in the paper.
The interpretation of kernel-based models is not trivial, and extracting the influence of individual features not always straightforward. For the approach presented in the paper, we cannot easily provide direct information on the usefulness of features or distinguishing patterns. In future work, we will use other machine learning techniques that are more suitable in this regard.
We will make the aspect of "rather high predictive accuracy" more precise in the abstract, and will put a concrete accuracy value in brackets.

researchart / swan17

Eiram: Predicting Rankings of Software Verification Competition #1

Reviewer1

Recommendation (select one)

Summary (1 para)

Advocacy (accept since, reject since, 1 para)

List of "Pros"

List of "Cons"

Changes needed before I can recommend accept (if any)

Reviewer2

Recommendation (select one)

Summary (1 para)

Advocacy (accept since, reject since, 1 para)

List of "Pros"

List of "Cons"

Changes needed before I can recommend accept (if any)

Minor concerns

Reviewer3

Recommendation (select one)

Summary (1 para)

Advocacy (accept since, reject since, 1 para)

List of "Pros"

List of "Cons"

Changes needed before I can recommend accept (if any)