tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.77k stars 412 forks source link

Fix flaky-test TestSpreadsheetExtractor#testRTL #533

Open same8891 opened 8 months ago

same8891 commented 8 months ago

Test failure Reproduction

mvn install -pl . -am -DskipTests -Dsign.skip
mvn -pl . edu.illinois:nondex-maven-plugin:2.1.1:nondex -Dtest=technology.tabula.TestSpreadsheetExtractor#testRTL

Non-Dex detected flakiness and got the error message. More precisely as shown below:

[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.436 s <<< FAILURE! - in technology.tabula.TestSpreadsheetExtractor
[ERROR] testRTL(technology.tabula.TestSpreadsheetExtractor)  Time elapsed: 0.434 s  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[اسمي سلطان]> but was:<[]>
    at technology.tabula.TestSpreadsheetExtractor.testRTL(TestSpreadsheetExtractor.java:458)

Root cause and fix

The failed assert is in line 458 file TestSpreadsheetExtractor.

assertEquals("اسمي سلطان", table.getRows().get(1).get(1).getText());

The flaky-test is caused by the function findSpreadsheetsFromCells() in SpreadsheetExtractionAlgorithm.java line 183. Because of using hashset and hashmap, this function will sometime return the result in different order.

public static List<Rectangle> findSpreadsheetsFromCells(List<? extends Rectangle> cells) {
    // via: http://stackoverflow.com/questions/13746284/merging-multiple-adjacent-rectangles-into-one-polygon
    List<Rectangle> rectangles = new ArrayList<>();
    Set<Point2D> pointSet = new HashSet<>();
    Map<Point2D, Point2D> edgesH = new HashMap<>();
    Map<Point2D, Point2D> edgesV = new HashMap<>();

This cause the flaky. To deal with this problem, I changed the hashset and hashmap to linkedhashset and linkedhashmap. The difference between [hashset,hashmap] and [linkedhashset,linkedhashmap] is that [linkedhashset,linkedhashmap] will return fixed order, but [hashset,hashmap] will return a random order. This ensure the function will be deterministic, which means it will return the result in fixed order.