uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

Delimiter detection returns first candidate delimiter even if it does not exist in the file #415

Closed tiddman closed 4 years ago

tiddman commented 4 years ago
String lines =
    " 4509484 2\n"
        + "user37748\taddress\t0___Ku0GD8\n"
        + "user37749\taddress\t__We4__E22\n"
        + "user37750\taddress\tU460436rJK\n"
        + "user37751\taddress\tFP_6x_d_Mw\n"
        + "user37752\taddress\t_LZ9_F_9_0\n"
        + "user37753\taddress\ti___jF54__\n"
        + "user37754\taddress\t_SBv0pVB__\n"
        + "user37755\taddress\t5SXcz__f7c\n"
        + "user37756\taddress\td_2VY__IPe\n"
        + "user37757\taddress\t3__mC1i__5\n"
        + "user37758\taddress\tu_cGnJ_7O_\n"
        + "user37759\taddress\t_E2f76sH_7\n"
        + "user37760\taddress\t__DsG_wb0N\n"
        + "user37761\taddress\t__669503_B\n"
        + "user37762\taddress\t_p8lCr3h9_\n"
        + "user37763\taddress\ti0MO1Mh8_A\n"
        + "user37764\taddress\t_2__Yg___4\n"
        + "user37765\taddress\t__E_10_xwK\n"
        + "user37766\taddress\tHz__RNGCN_\n";

StringReader stringReader = new StringReader(lines);

CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setDelimiterDetectionEnabled(true, '|', '\t');

CsvParser csvParser = new CsvParser(parserSettings);
csvParser.parseAll(stringReader);
CsvFormat detectedFormat = csvParser.getDetectedFormat();

log.info("delimiter = {}", detectedFormat.getDelimiter());

This example shows 20 lines that are delimited by \t, but which starts with a line fragment (the end of the previous line). When enabling delimiter detection and providing two candidate delimiters '|' and '\t', getDetectedFormat() returns the first one '|' even though it doesn't actually appear in the file, and the 2nd delimiter '\t' seems to be a much better match since it appears on all but one of the lines.

Looking through CsvFormatDetector.execute() it looks like the logic is to parse the file and count how many symbols appear on each line, and then exclude any symbols that don't appear on ALL lines, and if no symbols are left, default to the first candidate delimiter. In this case, there is no symbol that appears on all lines ('\t' appears on all but the first line, and the first line contains ' ' but that is not on some other lines), so the logic falls through and the first candidate delimiter '|' is returned, which seems incorrect since it isn't in the data at all.

This situation can occur in our data because files are split for processing and are otherwise somewhat inconsistent, but we were hoping there is enough good content here to detect the delimiter. Is this a bug or the intended behavior? Any suggestions on parser settings or other configurations we can make to get this to work as we intend?

jbax commented 4 years ago

Thanks! I've added some extra checks to consider the candidate delimiter that shows up in more lines.

I also released versin 2.9.1-SNAPSHOT with the adjustment so you can test it out.

Hope this solves your issue.

Thank you for using our parsers!