uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
918 stars 252 forks source link

CSV parser quote/delimiter auto detection question #318

Open stroestefan opened 5 years ago

stroestefan commented 5 years ago

Hi guys,

I am using the 2.6.3 version to parse some CSV files, and the automatic quote detection only seems to return ", even for files that have a different quote.

I attached a sample file which is comma delimited and percent quoted. Do you think the file isn't big enough and that's why it's not detecting things properly? CommaDelimitedPercentQuoted.txt

Note: I tried upgrading to version 2.8.1 but that seemed to perform worse, for the same file it detects the delimiter as % and the quoter as ".

Sample code:

String sample; // at this point this contains the attached file's contents
BufferedReader reader = new BufferedReader(new StringReader(sample));
CsvParserSettings settings = new CsvParserSettings();
settings.setQuoteDetectionEnabled(true);
settings.setDelimiterDetectionEnabled(true, ',', '#', ';', '\t', '|', '%');
settings.setLineSeparatorDetectionEnabled(true);
parser = new CsvParser(settings);
parser.parseAll(reader);

//And at this point:
char delimiter = parser.getDetectedFormat().getDelimiter();// this is ,
char quote = parser.getDetectedFormat().getQuote();// this is "

Can I add something to my code that will improve the detection? And is there a release date for version 3.0.0?

jbax commented 5 years ago

The quote detection only handles ' or " and anything else won't be detected. I'll make this work on version 3.0.0 which is being a pretty large refactoring. I've been working on it for a couple of months now but I'm hoping to get it released in a month or so.

stroestefan commented 5 years ago

Thank you for the fast reply! I'll keep using the 2.6.3 version for now and upgrade straight to 3.0.0 when it comes out.