minkull commented 4 years ago

https://github.com/researchart/rose6icse/tree/master/submissions/available/Ankou https://github.com/researchart/rose6icse/tree/master/submissions/reusable/Ankou

Valentin J.M. Manes
- mail: valentinmanes@outlook.fr
- GitHub ID: Jiliac
Soomin Kim
- mail: soomink@kaist.ac.kr
- GitHub ID: soomin-kim
Sang Kil Cha (corresponding author)
- mail: sangkilc@kaist.ac.k
- GitHub ID: sangkilc

seeking Reusable and Available Badges

minkull commented 4 years ago

Note to reviewers: these authors want multiple badges

reusable
available

ai-sta-website commented 4 years ago

@ai4se @sangkilc

I followed the instructions described in GitHub repository (https://github.com/researchart/rose6icse/tree/master/submissions/available/Ankou) to install Ankou on docker container running ubuntu 18.04. There are some problems when trying to reproduce the results:

No information about the 24 subjects used in RQ1 and RQ3.

As mentioned in the paper, authors obtained 150 different subjects from 24 packages and randomly selected one subject per package to form the benchmark (24 subjects in total) when evaluating Ankou on its impact of dimensionality reduction (RQ1 in Sec 6.2) and the necessity of distance-based fitness function (RQ3 in Sec 6.4). Table 1 illustrates all the experimental results. The 24 selected subjects used at this point are required to produce the results in Table 1. However, detailed information on which subject is randomly selected in the evaluation is not provided.

The evaluation reproduction section in `README.md` is too simple to follow.

The first step is to compile the 24 packages mentioned in the paper at the same version or commit using afl-gcc. None of these 24 package are provided. Collecting these packages online costs too much for reviewers. Also since docker container is used for fuzzing, it would be better if a dockerfile is available to set up the environment and compile all packages instead of leaving these steps to reviewers. The second step is to run the produced subjects with commands found in configuration.json. So reviewer still needs to convert this json file to 150 separate commands and run them in a docker container. A shell script should be provided. The third step is to analyze the output directory for results. The problem here is the statistics file of fuzzing campaign in $OUTPUT_DIR/status* is a bit too messy for reviewers to analyze. No detailed information or script to facilitate the analysis. Furthermore, reviewers cannot get coverage and throughput information from the output.

No steps to set up `safe stack hash` to triage crashes.

To evaluate the number of bugs found, crashes can be easily found from $OUTPUT_DIR/crashes-* directory. However, only unique bugs found by Ankou are listed in Table 2. As mentioned in Sec 6.7, authors decided to use safe stack hash to triage multiple crashes. Thus, without the detailed information about how to use safe stack hash, there is no way for reviewers to count the numbers of bugs based on the crash information.

@ai4se Could you shed some lights on the issue mentioned above?

timm commented 4 years ago

@sangkilc : please reply to the above.

@random-friendly-dude : I look forward to your review

ai-sta-website commented 4 years ago

@timm Hi Tim, I do have provided my review..

sangkilc commented 4 years ago

Thanks for the review. @Jiliac is preparing for benchmark programs. He will respond to this thread ASAP.

Jiliac commented 4 years ago

Thanks for the review. First of all, I am sorry the evaluation description was so brief. I didn't understand the reusable and available badges were also about reproducing the paper experiments.

We will add the 24 subjects of RQ1 and RQ3, and the steps to reproduce the safe stack hash triage we used by next Wednesday. Concerning the execution of the 150 subjects and their evaluation, we now provide the source of all the packages at the right version at: https://github.com/SoftSec-KAIST/Ankou-Benchmark. We will also add indications on what statistics from $OUTPUT_DIR/status* was used to make each data point for the RQs by Wednesday. However, producing a full-fledged benchmark with all subjects pre-compiled and ready to run would take too much time.

minkull commented 4 years ago

@sec365 looking forward to your review

sec365 commented 4 years ago

I agree with the first reviewer.

I am able to compile and run Ankou on binutils following the authors’ instructions. However, there is no information regarding where to download and compile the 24 packages mentioned in the paper. I also find it difficult to interpret the results in Ankou output directory.

I believe the current version does not satisfy the criteria of “available” and “reusable”.

====available==== Only the fuzzer is available. I suggest the authors make the evaluation subjects available as well. Otherwise, it is difficult to judge whether the tool is “functional”.

====reusable==== To facilitate reuse, the authors should provide more detailed instructions on how to set seeds and interpret the output of Ankou.

I see that the authors are trying to improve the submission. I will be happy to review the revised version.

Jiliac commented 4 years ago

The subjects used for RQ1 and RQ3 and their arguments are now in benchmark/rq1_rq3.json.
The seeds and sources for all packages are in a separate repository: https://github.com/SoftSec-KAIST/Ankou-Benchmark. The README contains instructions to install and run one package, and the Dockerfile builds an image able with two packages compiled. The README also explains how to interpret status* files to get the branch coverage, throughput, and the "effectiveness" metric used in RQ2.
The triage/ folder contains scripts to obtain the stack hash. The last section of the README give the command to use them.

Jiliac commented 4 years ago

Sorry for the brevity of the previous comment. We understand the previous submission was missing enough details to be able to reproduce the evaluation. Thus, we updated our repositories with new README.

We could have provided a single script that completely rebuilds every subject and redo our experiments, but it will simply take months to finish, which is simply infeasible to evaluate. So instead, we provide a Dockerfile that can automatically set up the whole environment including two packages in our benchmark. The two packages were chosen based on how fast we can observe the first crash using Ankou. This is to ease evaluating our tool without having to wait for hours to produce crashes. Plus, we provide detailed instructions on how to interpret our results.

In case you want to try more packages in our benchmark, we provide the list of subjects we used as well as their arguments in benchmark/configuration.json. We hope this version makes sense, and we will be happy to answer more questions if you have any.

timm commented 4 years ago

@sec365

@random-friendly-dude

does the author update let you allocate badges reusable and available?

please advise

ai-sta-website commented 4 years ago

@timm

Thanks. We are working on it according to the latest instruction provided by the authors. Will update today.

sec365 commented 4 years ago

The authors have addressed my earlier concerns.

The authors have made the 24 packages available at https://github.com/SoftSec-KAIST/Ankou-Benchmark. I tried to compile cflow. It was successful.
The authors also gave more instructions. Following the instructions at https://github.com/SoftSec-KAIST/Ankou, I successfully ran Ankou to fuzz cflow with the provided seeds. I let the tool run for about an hour and it detected 175 crashes. I was able to print the branch coverage, throughput, and effectiveness values with the provided python commands. I then ran cflow on a randomly chosen crashing input. I was able to use the scripts in the triage folder to obtain the stack hash.
The docker image can also be successfully built. The authors have set up the environment. I can run Ankou easily using the image.

Therefore, I am happy to recommend the two badges.

One suggestion: please consider releasing the artifacts at Zenodo (or similar services) as one archive and provide a DOI. Currently, they are released at separate repositories.

ai-sta-website commented 4 years ago

Installation

I followed the instruction steps described in the GitHub repository: https://github.com/SoftSec-KAIST/Ankou. It provides installation and evaluation steps of Ankou. I successfully built this tool using the provided command on my machine Ubuntu 18.04.

Evaluation

First, I compiled the source of the 24 program packages with afl-gcc based on the commands provided at https://github.com/SoftSec-KAIST/Ankou-Benchmark. CC=afl-gcc CXX=afl-g++ ./configure --prefix=pwd/build make -j make install

cmake .. \
    -DCMAKE_INSTALL_PREFIX=./locals \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_CXX_COMPILER=afl-g++
make -j
make install

Next, I parsed the json file benchmark/rq1_rq3.json using the following python script to reproduce the results of RQ1 and RQ3 in which Ankou is evaluated on its impact of dimensionality reduction and the necessity of distance-based fitness function. import json if name == 'main': data = {} with open('rq1_rq3.json', 'r+') as outfile: data = json.load(outfile) for put in data['puts']: bin_idx = put['bin_path'].rfind('/') bin_path = put['bin_path'][bin_idx+1:] seeds_path = ' -i seeds' args = ' -args' + ' \"' + ' '.join(put['args']) + '\"' output_path = ' -o ' + bin_path + '_out' log_path = bin_path + '_log.txt' cmd = 'go run github.com/SoftSec-KAIST/Ankou -app ' + bin_path + args + seeds_path + ' -threads 1' \

output_path + ' -dur 18h > ' + log_path + ' &' print(cmd) With the help of the python script, I got 24 commands to fuzz the 24 subjects with provided seeds. After fuzzing 24 hrs, I followed the scripts to analyze the results and get the branch coverage, overall throughput and effectiveness of the dynamic PCA. ➜ python -c "print(open('receiver.csv').readlines()[-1].split(',')[0])" 13317 ➜ python -c "last = open('seed_manager.csv').readlines()[-1].split(','); print(float(last[5])/int(last[6]))" 122.034280307 ➜ python -c "last = open('receiver.csv').readlines()[-1].split(','); print('{}%'.format(100-100*float(last[2])/float(last[1])))" 79.8460800428% Please find all the fuzzing results from the following table.	Subject	Crashes	Branch	Throughput
exifvalue	2251	13317	122.034280307	79.8460800428%
sassc	84	37448	112.429141626	54.5850970816%
pspp	0	4577	30.2649983446	70.0927292126%
bison	709	11070	25.195571745	76.1291791496%
cflow	1520	2915	123.314212801	73.5590002317%
avprobe	53	24968	28.750011498	75.3718317888%
asn1Coding	0	1276	128.875301479	71.7316440233%
listaction_d	380	9782	116.453881258	77.0281168827%
tiffinfo	0	7091	204.319725339	71.3228179181%
toe	1225	2358	72.2312728432	73.3090398555%
tcpdump	0	10198	142.976365184	54.3083022314%
clambc	913	8146	124.122272853	67.3087975654%
dwarfdump	51	9493	201.101765329	81.2333813153%
dump_torrent	0	2311	116.031338146	58.8446577037%
nasm	4	6441	64.9238600896	68.9811902784%
vim	1130	47348	25.2132525681	62.449447189%
catdoc	1	767	129.388910667	51.9245235833%
xpstopdf	110	1238	52.6846275753	44.1574358861%
mpg123	0	4911	39.3765910061	74.4297278709%
dcraw_half	0	5338	111.019220114	67.5267171476%
lou_trace	0	3614	69.549948043	72.5370711436%
gm	1	18190	99.9096913502	62.5383348125%
jasper	1072	9618	199.767184035	79.9024780275%
cxxfilt	0	3270	187.97047608	62.34051213%

I compared the evaluation results with the numbers shown in paper (mainly Table 1 and Figure 4). The branch coverage and overall throughput are almost the same with acceptable differences. The effectiveness of the dynamic PCA is around 70% which is below 80% mentioned in Sec 6.3 of paper. I think it's because I only evaluate 24 subjects out of 150 subjects in total. Based the crashes, the stack hashes can be easily found following the setup steps.

Summary

The author provided enough and detailed instructions to build and evaluate its tool Ankou. All steps can be done smoothly. The evaluate results are also quite promising. I believe that Ankou are reusable and available.

researchart / rose6icse

Ankou #94

seeking Reusable and Available Badges

No information about the 24 subjects used in RQ1 and RQ3.

The evaluation reproduction section in `README.md` is too simple to follow.

No steps to set up `safe stack hash` to triage crashes.

Installation

Evaluation

Summary

researchart / rose6icse

Ankou #94

seeking Reusable and Available Badges

No information about the 24 subjects used in RQ1 and RQ3.

The evaluation reproduction section in README.md is too simple to follow.

No steps to set up safe stack hash to triage crashes.

Installation

Evaluation

Summary

The evaluation reproduction section in `README.md` is too simple to follow.

No steps to set up `safe stack hash` to triage crashes.