researchart / rose6icse

12 stars 71 forks source link

icse20-main-171 #113

Closed minkull closed 4 years ago

minkull commented 4 years ago

https://github.com/researchart/rose6icse/tree/master/submissions/available/icse20-main-171 https://github.com/researchart/rose6icse/tree/master/submissions/reusable/icse20-main-171 https://github.com/researchart/rose6icse/tree/master/submissions/replicated/icse20-main-171 https://github.com/researchart/rose6icse/tree/master/submissions/reproduced/icse20-main-171

corresponding author for artifact evaluation: Hui Guo

Authors

\author{Hui GUo}
\affiliation{
  \institution{University of California, Davis}
 }

\author{Cindy Rubio-Gonz\'alez}
\affiliation{
  \institution{University of California, Davis}
 }

Note to reviewers: these authors want multiple badges

unknown-user1234 commented 4 years ago

The authors apply for the Reusable, Available, Replicated and Reproduced Badges.

Available: The code was carefully organized and made available in a repository. A documentation regarding the download and navigation through main folders was also provided.

Decision: Accept

Reusable: Examples along with documentation were made available so that the reuse of the approach in different source code files was facilitated.

Decision: Accept

Replicated: The authors provided a information on how to build the environment and perform the experiments in the paper. However, the provided command lines (e.g., nohup ./fpgen.sh sum 1norm & [I tried with and without the “&”]) for experiments replication didn’t execute properly.

Decision: Reject

Reproduced: This badge involves the replication of the experiments not using the code of the authors. This is not possible at this moment.

Decision: Reject

huiguoo commented 4 years ago

@timm @minkull @random-friendly-dude @crubiog

@unknown-user1234, thank you for the feedback.

Could you please provide more information on what error the reviewer received for "Replicated"? The feedback simply says the command did not run. However, we provided a Docker to run the commands in, and several people outside our project verified that the commands ran successfully and the results were replicated. We do not see how running the command in the provided Docker could fail.

Thank you,
Hui

ai-sta-website commented 4 years ago

@HGuo15

Review FPGen

Artifact summary


In this paper, authors transform the problem of generating high error-inducing inputs into the code coverage maximization problem that can be solved by performing symbolic execution.

FPGen leverages enables symbolic execution to explore all rounding and cancellation possibilities in different code areas, by injecting inaccuracy checks after floating-point arithmetic operations.

The artifact is publicly available and reusable. They evaluate FPGen on 3 summation algorithms, 9 matrix computation routines from the Meschach library, and 15 statistics routines from the GNU Scientific library (GSL), compared with (1) a random input generator, (2) S3FP, the state-of-the-art floating-point error-inducing input generator, and (3) KLEE-Float, the floating-point symbolic execution engine used in FPGen.

General assessment


I followed the instruction step described in the GitHub repository: https://github.com/ucd-plse/FPGen.

I used the Ubuntu 18.04.3 LTS featuring Intel® Core™ i7-8565U CPU @ 1.80GHz with 12GB of memory to run the FPGen container in docker.

I run 11 out of 27 benchmarks, the time bound is set to 2 hours. Total running time is about 66 hours.

Result


benchmark Rel.Error (Random) Rel.Error (S3FP) Rel.Error (KLEE-Float) Rel.Error (FPGen)
pairwise-summation 0.0000e+00 0.0000e+00 0.0000e+00 1.3174e-16
2norm 3.1249e-16 3.1170e-16 0.0000e+00 2.2117e-16
dot 1.7010e-12 4.4579e-09 0.0000e+00 1.9190e-04
lu 0.0000e+00 0.0000e+00 0.0000e+00 2.7327e+00
wmean 1.72281e-11 1.75737e-07 0.0000e+00 1.0000e+00
wvariance-w 8.85416e-11 2.09184e-05 0.0000e+00 2.2858e-12
wsd-w 4.42709e-11 1.04591e-05 0.0000e+00 1.1429e-12
wtss 5.54205e-16 5.31847e-16 0.0000e+00 4.4513e-16
wabsdev 3.44206e-11 2.20766e-05 0.0000e+00 1.0000e+00
wkurtosis 4.51066e-11 1.40364e-07 0.0000e+00 1.7733e-12
wskew-m 3.78488e-10 0.0316462 0.0000e+00 2.5675e+01

Summary


The results of all tested benchmarks are exactly the same with the results of the paper.

Badge


The reviewer believes the submitted artifact can be given Reusable and Available badges. Nevertheless, the other two badges are not feasible at this stage. In case you were not aware, the "Replicated" and "Reproduced" badges are eligible in the sense that your results have been obtained by other research articles in the community. Please refer to this instruction and let us know if you disagree (https://conf.researchr.org/track/icse-2020/icse-2020-Artifact-Evaluation#Call-for-Submissions).

unknown-user1234 commented 4 years ago

Could you please provide more information on what error the reviewer received for "Replicated"? The feedback simply says the command did not run...

Hi Hui GUo,

Follows the output of the command:

  1. fptesting@04b53a37b055:/home/FPTesting/benchmarks/matrix$ nohup ./fpgen.sh sum 1norm &
  2. [1] 19
  3. fptesting@04b53a37b055:/home/FPTesting/benchmarks/matrix$ nohup: ignoring input and appending output to '/home/fptesting/nohup.out'
  4. [1]+ Exit 1 nohup ./fpgen.sh sum 1norm

I indexed the lines in the Terminal to easier explain you what I did. I executed line 1, Then I waited for around 2,5 hours. Since it didn't finish, I pressed "Enter" again. Then the lines 2 to 5 were displayed.

It is worth to mention that my CPU has 4 cores and 16 gb of memory. Could it be the cause?

Best regards.

huiguoo commented 4 years ago

@unknown-user1234 The job is executed in the background and I believe you have successfully run our tool "FPGen". You can then do ls in the current directory, and you should be able to see a file named result-fpgen.txt. This file has the results of the two tests you run, i.e., sum, 1norm.

After that, you can manually inspect the results and compare them to the paper, or you can use our script to automatically check the results of sum and 1norm:

../../scripts/cmp_to_ref.sh -m result-fpgen.txt reference/result-fpgen.txt sum 1norm

This script will print pass when the results match the paper, or print fail when they don't match. Let me know if you have any questions.

unknown-user1234 commented 4 years ago

@HGuo15

Even considering your instruction, I didn't manage to execute the code (maybe something is missing in my environment). Anyway, since @random-friendly-dude executed, I am satisfied. However, I keep my decision because replicating all the experiments (necessary condition to achieve the replicated badge) is unfeasible, it demands a large amount of time, and reproducing them is impossible at this moment.

Best regards!

timm commented 4 years ago

I am agreeing with the above reviewers that this artifact merits "available" and "reusable" but not "replicated" or "reproduced"

it the authors wish to dispute that decision, then I refer to the criteria for replicated and reproduced at https://conf.researchr.org/track/icse-2020/icse-2020-Artifact-Evaluation#Call-for-Submissions.

timm commented 4 years ago

meanwhile I will NOT close this issue just in case there is any further discussion

crubiog commented 4 years ago

@random-friendly-dude @unknown-user1234 @timm @HGuo15

We agree with the recommendation to receive Reusable and Available badges. @random-friendly-dude, thank you for linking to the definition of Replicated and Reproduced badges, which helped to clarify the situation.

@unknown-user1234 said:

However, I keep my decision because replicating all the experiments (necessary condition to achieve the replicated badge) is unfeasible, it demands a large amount of time, and reproducing them is impossible at this moment.

We just want to make clear that we are not receiving the Replicated or Reproduced badges because our results have not been obtained in a subsequent study by a person or team other than the authors (though the subset of experiments run by @random-friendly-dude were reproducible). The reasons given by @unknown-user1234 above, however, are not the reasons why the badges are not recommended.