Are there any factors that could cause the output of experimental results to be unstable?

955xiaoSu commented 10 months ago

First of all, thank you very much for your outstanding work and for selflessly making it open source.

However, during its usage, I encountered a confusing issue: why are the number of experimental results obtained in the /output folder less than the number of experimental targets?

For example, as shown in the figure 1 below, I ran 9 experimental targets, but received fewer than 9 results. The way I ran it was: python3 /DAFL-artifact/scripts/reproduce.py run tbl2 7200 1. I modified the target_list variable in the main() of reproduce.py for the 'tb12' branch to specify the experimental targets, and also changed the TARGETS variable in benchmark.py in generate_fuzzing_worklist() to ensure that the generated worklist matches.

According to the output during the execution of reproduce.py, I guess if the status of its Docker container is still not FINISHED after the predetermined time of 120 minutes, does this mean that run_Beacon.sh will not call common-postproc.sh, thus failing to produce experimental results?

Or is it because the execution time required for the for loop in common-postproc.sh is too long, resulting in the subsequent cp -r output /output not being executed, ultimately causing store_outputs() in reproduce.py to actually copy an empty folder to the host machine? Yet even so, it's impossible that not even an empty replay_log.txt file exists in the corresponding /output path on the host machine.

I am unable to confirm the cause of the aforementioned problem, so I am specifically seeking help.

goodtaeeun commented 9 months ago

Hi, let me better understand the problem first before jumping into suggestions.

So after running

python3 /DAFL-artifact/scripts/reproduce.py run tbl2 7200 1

with a modified list of 9 targets, did you get successful results with at least some of the targets? I just want to check if fuzzing process is working for at least some of the targets in your environment.

955xiaoSu commented 9 months ago

Hi, let me better understand the problem first before jumping into suggestions.

So after running
python3 /DAFL-artifact/scripts/reproduce.py run tbl2 7200 1
with a modified list of 9 targets, did you get successful results with at least some of the targets? I just want to check if fuzzing process is working for at least some of the targets in your environment.

According to the current experimental results, the number of output results that can be produced is very unstable. Sometimes there is no result at all (such as the round of experiments with 9 targets mentioned above), and sometimes partial output results can be produced. The picture below is the result of an experiment that was just run yesterday. When typing the ls command in the output directory, there is no output result.

In yesterday’s round of experiments, the following modifications were made to reproduce.py and benchmark.py:

reproduce.py: Specify tools as DAFL_noasan;
benchmark.py: Comment the unnecessary target in the FUZZ_TARGET variable.

Apart from the above two changes, I have not made any other changes. Finally type the command: python3 /root/DAFL-artifact/scripts/reproduce.py run tbl2 7200 2.

Other than that, I make sure the machine has enough memory to perform fuzzing.

955xiaoSu commented 9 months ago

After reviewing the entire experiment, I believe the root cause of the observed results is the spawn_container() function in reproduce.py. It seems the '--rm' parameter did not function as expected. When I ran docker ps -a, I noticed that there were containers, which should have been deleted after each iteration of the experiment, still present. They had the same names as the target containers. This means that in the experiment, the containers intended for fuzzing the target were not actually started, which explains why I did not get any results in the end.

To resolve this, I simply pruned all unused containers and then redid the experiment according to the instructions in README.md. After this adjustment, everything worked well!

goodtaeeun commented 9 months ago

Ah, now I also see what the problem was.

When the previous experiment successfully finishes, reproduce.py cleans up the docker containers. However, if you have terminated the experiment script for some reason, for example, with ctrl + c, only the python script is killed, but not the docker containers that are actually running the fuzzing sessions. In this case, the --rm option does not help. Thus, as you have figured out, you must kill all the docker containers in order to run the next experiment successfully.

I think this can be resolved by modifying the script to check existing containers before spawning new ones.

955xiaoSu commented 9 months ago

Ah, now I also see what the problem was.

When the previous experiment successfully finishes, reproduce.py cleans up the docker containers. However, if you have terminated the experiment script for some reason, for example, with ctrl + c, only the python script is killed, but not the docker containers that are actually running the fuzzing sessions. In this case, the --rm option does not help. Thus, as you have figured out, you must kill all the docker containers in order to run the next experiment successfully.

I think this can be resolved by modifying the script to check existing containers before spawning new ones.

Thank you for pointing out the issue with the Docker containers and suggesting a modification to the script. I'll make sure to implement this check to avoid any conflicts in future experiments. Your guidance is greatly appreciated!

prosyslab / DAFL-artifact

Are there any factors that could cause the output of experimental results to be unstable? #3