uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

CSV Parser interprets one line as multiple with different column amount #409

Closed Bios-Marcel closed 4 years ago

Bios-Marcel commented 4 years ago

Hey,

I was trying to parse a csv file consisting of multiple rows, where each cell may contain spaces, commas, dots and newlines with the quotes being on the next line. It appears, that when a row contains a combination of trailing dots, and multiple multiline quoted cells, the result ends up being wrong.

To reproduce the problem, you can use the following data:

A   B   C.  "G
I
"   "J
M"

The cell separator in this case is a \t.

I am using the following code to parse the string:

  private static String[][] parseCSV( final String rawData )
  {
    final List<String[]> rows= new ArrayList<>();
    final CsvParserSettings settings = new CsvParserSettings();
    settings.detectFormatAutomatically( '\t' );
    settings.setIgnoreLeadingWhitespaces( false );
    settings.setIgnoreTrailingWhitespaces( false );
    settings.setSkipEmptyLines( false );

    //Ansonsten sind leere Zeilen null-values und führen zu Fehlern.
    settings.setNullValue( "" );

    settings.setProcessor( new AbstractRowProcessor()
    {
      @Override
      public void rowProcessed( final String @Nullable [] row, final @Nullable ParsingContext __ )
      {
        if ( row != null )
        {
          rows.add( row );
        }
      }
    } );

    final CsvParser parser = new CsvParser( settings );
    try ( StringReader reader = new StringReader( rawData ) )
    {
      parser.parse( reader );
    }
    return rows.toArray( new String[rows.size()][] );
  }

The resulting array contains:

[[A  B C.  "G], [I], [" "J], [M"]]

instead of

[[A, B, C., G
I
, J
M]]
Bios-Marcel commented 4 years ago

It appears to not be fixed with the latest commit*

jbax commented 4 years ago

Your input has an unescaped quote so you can try a different unescaped quoted handling mode such as

settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.BACK_TO_DELIMITER);

If you don't want to handle quotes at all set the quote character to \0

Thank you for using our parsers!

Bios-Marcel commented 4 years ago

I am confused. This is exactly how excel gives me the CSV data. Those aren't unescaped quotes, but quotes around multi-line cells. This should be parsable, am I wrong?

jbax commented 4 years ago

Sorry the test I wrote had an extra quote. I'll look at this again

On Fri, Aug 14, 2020, 4:01 PM Marcel Schramm notifications@github.com wrote:

I am confused. This is exactly how excel gives me the CSV data. Those aren't unescaped quotes, but quotes around multi-line cells. This should be parsable, am I wrong?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/uniVocity/univocity-parsers/issues/409#issuecomment-673912468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWFQPRBHKWGFJXFJREEYRTSATK5ZANCNFSM4PP32B5A .

jbax commented 4 years ago

Fixed! It was caused by the auto-detection process that assigned \n as the quote escape and wrecked everything.

Thank you for reporting the bug and sorry for the earlier confusion.

Bios-Marcel commented 4 years ago

Thanks :)

jbax commented 4 years ago

Just released version 2.9.0.

Thanks for using our parsers!

On Sat, 15 Aug 2020 at 17:49, Marcel Schramm notifications@github.com wrote:

Thanks :)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/uniVocity/univocity-parsers/issues/409#issuecomment-674367505, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWFQPSLHAPLHLYH45DCTCLSAZAI5ANCNFSM4PP32B5A .

Bios-Marcel commented 4 years ago

Works fine 👍