stg-tud / MUDetect

Mozilla Public License 2.0
24 stars 8 forks source link

Non-Deterministic Mining Results #20

Closed SNielebock closed 4 years ago

SNielebock commented 5 years ago

When running the current version of the AUGMiner (0.0.3-SNAPSHOT), we observed non-deterministic mining results, i.e, different patterns as well as different numbers of patterns, while the input data, i.e., the source files, and the parameters of the Miner remains the same.

A minimal example is in the attached file 4.txt with a single method.

We generate the AUGs as follows:

File srcFile = new File(<path2File>);
AUGBuilder augBuilder = new AUGBuilder(new DefaultAUGConfiguration());
String[] classPaths = {};
Collection<APIUsageExample> collectedAUGs = augBuilder.build(srcFile.getAbsolutePath(), classPaths);

However, we also load the AUGs from previously serialized ones. Thus, we do not think that the non-deterministic results are caused by a non-deterministic AUG generation.

We configure the miner as follows:

Configuration minerConfig = new Configuration();
minerConfig.outputPath = new File("tmp_aug")
minerConfig.minPatternSize = 2;
minerConfig.maxPatternSize = 6;
minerConfig.minPatternSupport = (int)(Math.ceil(collectedAugs.size() * 0.01));
minerConfig.maxPatternSupport = (int)(Math.ceil(collectedAugs.size() * 1));

We also set the edu.iastate.cs.egroum.dot.DotGraph.EXEC_DOT variable to the respective dot-executable.

The miner is then executed via

DefaultAUGMiner miner = new DefaultAUGMiner(minerConfig);
miner.mine(collectedAUGs);

We observed the non-deterministic behavior under Windows as well as under Ubuntu. Is this non-deterministic behavior of the miner expected or is this an issue on how we applied the miner.

SNielebock commented 5 years ago

We did a further analysis of that bug. Attached you find our testbed (Main.java.txt) as well as the Java file which reveals that behavior (4.java.txt).

We found out that:

For us this looks like a classical Heisenbug somewhere in the construction of the AUG.

We also observed this behavior on two independent computers, both running Windows 10.

salsolatragus commented 5 years ago

Hi,

thank you very much for reporting and tracing this bug.

We had problems with indeterministic AUG generation is the past. Seems we didn't rule that out entirely yet... unfortunately, I will not be able to debug this anytime soon. Can you work around this by using serialization for the time being? I realize this is far from optimal...


Meanwhile, the mining is expected to be non-deterministic. For scalability reason, we do not perform an exhaustive search of all subgraphs, but greedily extend pattern candidates (no backtracking). This implies we might miss patterns. Since AUGs use Java HashSets to store edges, and the mining explores AUGs along edges, it might select neighbor nodes for extension in a different order in a different run and, thus, end up with different patterns. The same is true for the detection.

For a description of both algorithms, please check the preprint of our paper, which just became available.

We tried once to root out this source of non-determinism by using sets with stable order, but it turns out we already get things with non-deterministic order from the JDT (which we use for parsing and resolving the source code), such that we cannot entire control this.

salsolatragus commented 4 years ago

Closing, since no reply since 6 months.