uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

New line symbol in a string the TSV file #419

Closed wuxiaomin98 closed 4 years ago

wuxiaomin98 commented 4 years ago

Hi @jbax , had a user case to parse tsv file.

We have a tsv file to parse and in a one of the fields, it contains new lines in it. For example:

id title description 1 titletest1 descriptiontest1 2 "titiletest2
newline" descriptiontest2

    TsvParserSettings settings = new TsvParserSettings();
    settings.setHeaderExtractionEnabled(true);
    TsvParser parser = new TsvParser(settings);
    parser.beginParsing(file, "UTF-8");
    Record record;

    while ((record = parser.parseNextRecord()) != null) {
        try {
            System.out.println("record " + record);
        } catch (Exception e) {
            logger.info("The process continue by skipping the line");
        }
    }

When I print out the record, seems it treats each line as a new record, it will print

record 1, titletest1, descriptiontest1
record 2, "titiletest1, null
record newline", descriptiontest2

Actually it should be two records:

1   titletest1  descriptiontest1
2   "titiletest1    newline"    descriptiontest2

I checked the TSV settings, it's different from the csv settings, not providing the setting to normalizeLineEndingsWithinQuotes. It can't parse the "\n" in a quoted string.

"titiletest2    
newline"
TsvFormat:
    Comment character=#
    Escape character=\
    Line separator (normalized)=\n
    Line separator sequence=\n

Any suggestion on this? Thanks!

wuxiaomin98 commented 4 years ago

Sample test file.

wuxiaomin98 commented 4 years ago

The above sample test file has one record. however, since the description field in a quoted string has new lines in it, it prints multiple records after parsing.

jbax commented 4 years ago

You are parsing a CSV with happens to use tabs as the separator. TSV has no notion of quotes. Use the CsvParser to process this input.