Support control characters in `@CsvSource` and `@CsvFileSource`

xazap commented 6 months ago

Description

I am writing unit tests where test cases have input strings with (non-printable) control characters. These characters generally occupy code points U+0000 through U+001F. When using @CsvSource I am finding that using control character literals in strings behaves differently from printable characters.

For example:

An unquoted \u0000 literal is translated to null.
A quoted \u0000 literal is translated to an empty string "".

This behavior is observed with both Eclipse's internal JUnit 5 test runner and with Maven's Surefire plugin. I have considered the possible impact of the nullValues parameter of @CsvSource. This attribute defaults to {}, so translation to null or an empty string is therefore not expected.

Steps to reproduce

The test below should pass, but unexpectedly fails for @CsvSource test cases that have \u0000 literals.

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.CsvSource;

class Reproduction {

    @Test
    void proveThatStringWithControlCharacterLiteralIsNotNullAndHasLenghtOfOne() {
        assertEquals(1, "\u0000".length());
    }

    @ParameterizedTest
    @CsvSource(delimiterString = "||", textBlock = """
        A       || 1
        \u0000  || 1
        B\u0000 || 2
    """)
    void testWithUnquotedInput(String testcase, Integer expectedLength) {
        assertNotNull(testcase);
        assertEquals(expectedLength, testcase.length());
    }

    @ParameterizedTest
    @CsvSource(delimiterString = "||", textBlock = """
        'A'       || 1
        '\u0000'  || 1
        'B\u0000' || 2
    """)
    void testWithQuotedInput(String testcase, Integer expectedLength) {
        assertNotNull(testcase);
        assertEquals(expectedLength, testcase.length());
    }

}

Context

Used versions

Jupiter 5.11.0-M2
Platform 1.11.0-M2

Build Tool/IDE

Eclipse 2024.03
Maven 3.9.6
JVM: Java HotSpot(TM) 64-Bit Server VM (build 17+35-LTS-2724, mixed mode, sharing)

sbrannen commented 6 months ago

Hi @xazap,

Thanks for raising the issue.

I edited your description to clarify quoting vs. escaping.

In addition, I confirmed the behavior you have reported.

This may be an issue with the CSV parsing library that we use to support @CsvSource.

In any case, we'll investigate what our options are.

sbrannen commented 6 months ago

This may be an issue with the CSV parsing library that we use to support @CsvSource.

That's indeed the case.

It looks like the Univocity CSV parser ignores control characters by default.

When I add the following to our CsvParserFactory.createParserSettings(...) method, all invocations of your testWithUnquotedInput() parameterized test pass.

settings.setSkipBitsAsWhitespace(false);

However, the latter two invocations of testWithQuotedInput() still fail, and I'm not yet sure if we can influence that.

I assumed the skipBitsAsWhitespace would apply to both quoted and unquoted text, but that appears not to be the case.

sbrannen commented 6 months ago

OK, after a bit more experimentation, I got your Reproduction test cases (and the rest of the JUnit 5 suite) passing with the following additions to our CsvParserFactory.createParserSettings(...) method.

settings.getFormat().setCharToEscapeQuoteEscaping('\\');
settings.setSkipBitsAsWhitespace(false);

Although these changes in the settings do not cause any of the tests in our test suite to fail, I'm a bit hesitant to change them for all users.

We may wish to introduce attributes in @CsvSource and @CsvFileSource to allow users to opt into these features; however, we would ideally like to keep the number of attributes in those annotations to a minimum.

In light of that, we'll discuss this topic during one of our upcoming team calls.

sbrannen commented 6 months ago

Please note that control characters are ignored in your testWithUnquotedInput() test case, because they are considered leading or trailing whitespace.

Thus, the following passes without any modifications to JUnit Jupiter, since C\u0000D contains \u0000 between other non-whitespace characters.

@ParameterizedTest
@CsvSource(delimiterString = "||", textBlock = """
        A        || 1
        C\u0000D || 3
        """)
void testWithUnquotedInput(String testcase, Integer expectedLength) {
    assertNotNull(testcase);
    assertEquals(expectedLength, testcase.length());
}

Similarly, the following also passes without any modifications to JUnit Jupiter by setting ignoreLeadingAndTrailingWhitespace = false and removing all whitespace between columns and the delimiters.

    @ParameterizedTest
    @CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
            A,1
            \u0000,1
            B\u0000,2
            """)
    void testWithUnquotedInput(String testcase, Integer expectedLength) {
        assertNotNull(testcase);
        assertEquals(expectedLength, testcase.length());
    }

xazap commented 6 months ago

Please note that control characters are ignored in your testWithUnquotedInput() test case, because they are considered leading or trailing whitespace.

Thank you for explaining! For me it's confusing if @CsvSource has a different definition of whitespace than the Java SDK itself. Since java.lang.Character.isWhitespace('\u0000') returns false, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition. Looking at the Javadoc of @CsvSource it mentions the term whitespace but doesn't clarify which code points it considers whitespace. Maybe this could be added to the Javadoc?

Thus, the following passes without any modifications to JUnit Jupiter, since C\u0000D contains \u0000 between other non-whitespace characters.

Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.

Similarly, the following also passes without any modifications to JUnit Jupiter by setting ignoreLeadingAndTrailingWhitespace = false and removing all whitespace between columns and the delimiters.
  @ParameterizedTest
  @CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
          A,1
          \u0000,1
          B\u0000,2
          """)
  void testWithUnquotedInput(String testcase, Integer expectedLength) {
      assertNotNull(testcase);
      assertEquals(expectedLength, testcase.length());
  }

This works! The formatting is not as I would like, but it is an acceptable workaround. I am confused about the meaning of ignoreLeadingAndTrailingWhitespace=false though. If leading whitespace is not to be ignored, why is the leading whitespace before A not part of the testcase?

Also, I noticed something odd: if I move the closing """) in your fixed example one tab to the left, the test fails for all three cases:

    @ParameterizedTest
    @CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
            A,1
            \u0000,1
            B\u0000,2
        """)
    void testWithUnquotedInput(String testcase, Integer expectedLength) {
        assertNotNull(testcase);
        assertEquals(expectedLength, testcase.length());
    }

Why would one less trailing whitespace character matter in this case?

sbrannen commented 6 months ago

Thank you for explaining!

You're welcome!

For me it's confusing if @CsvSource has a different definition of whitespace than the Java SDK itself. Since java.lang.Character.isWhitespace('\u0000') returns false, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition.

I understand how that can be confusing.

To be honest, I was not aware of the difference with the Univocity parser's default behavior, and I doubt anyone else on the JUnit team was aware of that either.

If I understood the documentation correctly, the difference is due to the fact that some databases include control characters in their exported CSV files which are typically ignored when importing or working with those CSV files.

Looking at the Javadoc of @CsvSource it mentions the term whitespace but doesn't clarify which code points it considers whitespace.

As I mentioned above, we were unaware of that difference.

Maybe this could be added to the Javadoc?

Yes, we can definitely update the Javadoc to make that explicit.

However, I'd first like to discuss these topics within the team before committing to anything concrete.

Thus, the following passes without any modifications to JUnit Jupiter, since C\u0000D contains \u0000 between other non-whitespace characters.

Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.

I was not suggesting that you use that as a workaround. Rather, I was merely pointing out how things work with the default CSV parser settings.

Similarly, the following also passes without any modifications to JUnit Jupiter by setting ignoreLeadingAndTrailingWhitespace = false and removing all whitespace between columns and the delimiters.

This works! The formatting is not as I would like, but it is an acceptable workaround.

I'm glad to hear that's a suitable workaround for you. 👍

I am confused about the meaning of ignoreLeadingAndTrailingWhitespace=false though. If leading whitespace is not to be ignored, why is the leading whitespace before A not part of the testcase?

There is no whitespace before A in that example. See below.

Also, I noticed something odd: if I move the closing """) in your fixed example one tab to the left, the test fails for all three cases:

Why would one less trailing whitespace character matter in this case?

If you move the closing """, you have introduced intentional whitespace in the String.

This is simply how text blocks in Java work.

The documentation in the User Guide for @CsvSource states the following:

It is therefore recommended that the closing text block delimiter (""") be placed either at the end of the last line of input or on the following line, left aligned with the rest of the input.

And:

Java’s text block feature automatically removes incidental whitespace when the code is compiled. However other JVM languages such as Groovy and Kotlin do not. Thus, if you are using a programming language other than Java and your text block contains comments or new lines within quoted strings, you will need to ensure that there is no leading whitespace within your text block.

I suggest you read that link which points to the Programmer's Guide to Text Blocks.

Hopefully that clarifies things!

xazap commented 6 months ago

I suggest you read that link which points to the Programmer's Guide to Text Blocks.

Hopefully that clarifies things!

Cheers, there is a lot more to text blocks than I knew! It makes perfect sense now.

junit-team / junit5