Closed tiddman closed 4 years ago
Thanks! I've added some extra checks to consider the candidate delimiter that shows up in more lines.
I also released versin 2.9.1-SNAPSHOT with the adjustment so you can test it out.
Hope this solves your issue.
Thank you for using our parsers!
This example shows 20 lines that are delimited by
\t
, but which starts with a line fragment (the end of the previous line). When enabling delimiter detection and providing two candidate delimiters'|'
and'\t'
,getDetectedFormat()
returns the first one'|'
even though it doesn't actually appear in the file, and the 2nd delimiter'\t'
seems to be a much better match since it appears on all but one of the lines.Looking through
CsvFormatDetector.execute()
it looks like the logic is to parse the file and count how many symbols appear on each line, and then exclude any symbols that don't appear on ALL lines, and if no symbols are left, default to the first candidate delimiter. In this case, there is no symbol that appears on all lines ('\t'
appears on all but the first line, and the first line contains' '
but that is not on some other lines), so the logic falls through and the first candidate delimiter'|'
is returned, which seems incorrect since it isn't in the data at all.This situation can occur in our data because files are split for processing and are otherwise somewhat inconsistent, but we were hoping there is enough good content here to detect the delimiter. Is this a bug or the intended behavior? Any suggestions on parser settings or other configurations we can make to get this to work as we intend?