Open xazap opened 6 months ago
Hi @xazap,
Thanks for raising the issue.
I edited your description to clarify quoting vs. escaping.
In addition, I confirmed the behavior you have reported.
This may be an issue with the CSV parsing library that we use to support @CsvSource
.
In any case, we'll investigate what our options are.
This may be an issue with the CSV parsing library that we use to support
@CsvSource
.
That's indeed the case.
It looks like the Univocity CSV parser ignores control characters by default.
When I add the following to our CsvParserFactory.createParserSettings(...)
method, all invocations of your testWithUnquotedInput()
parameterized test pass.
settings.setSkipBitsAsWhitespace(false);
However, the latter two invocations of testWithQuotedInput()
still fail, and I'm not yet sure if we can influence that.
I assumed the skipBitsAsWhitespace
would apply to both quoted and unquoted text, but that appears not to be the case.
OK, after a bit more experimentation, I got your Reproduction
test cases (and the rest of the JUnit 5 suite) passing with the following additions to our CsvParserFactory.createParserSettings(...)
method.
settings.getFormat().setCharToEscapeQuoteEscaping('\\');
settings.setSkipBitsAsWhitespace(false);
Although these changes in the settings do not cause any of the tests in our test suite to fail, I'm a bit hesitant to change them for all users.
We may wish to introduce attributes in @CsvSource
and @CsvFileSource
to allow users to opt into these features; however, we would ideally like to keep the number of attributes in those annotations to a minimum.
In light of that, we'll discuss this topic during one of our upcoming team calls.
Please note that control characters are ignored in your testWithUnquotedInput()
test case, because they are considered leading or trailing whitespace.
Thus, the following passes without any modifications to JUnit Jupiter, since C\u0000D
contains \u0000
between other non-whitespace characters.
@ParameterizedTest
@CsvSource(delimiterString = "||", textBlock = """
A || 1
C\u0000D || 3
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Similarly, the following also passes without any modifications to JUnit Jupiter by setting ignoreLeadingAndTrailingWhitespace = false
and removing all whitespace between columns and the delimiters.
@ParameterizedTest
@CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
A,1
\u0000,1
B\u0000,2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Please note that control characters are ignored in your
testWithUnquotedInput()
test case, because they are considered leading or trailing whitespace.
Thank you for explaining! For me it's confusing if @CsvSource
has a different definition of whitespace than the Java SDK itself. Since java.lang.Character.isWhitespace('\u0000')
returns false
, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition. Looking at the Javadoc of @CsvSource
it mentions the term whitespace but doesn't clarify which code points it considers whitespace. Maybe this could be added to the Javadoc?
Thus, the following passes without any modifications to JUnit Jupiter, since
C\u0000D
contains\u0000
between other non-whitespace characters.
Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.
Similarly, the following also passes without any modifications to JUnit Jupiter by setting
ignoreLeadingAndTrailingWhitespace = false
and removing all whitespace between columns and the delimiters.@ParameterizedTest @CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """ A,1 \u0000,1 B\u0000,2 """) void testWithUnquotedInput(String testcase, Integer expectedLength) { assertNotNull(testcase); assertEquals(expectedLength, testcase.length()); }
This works! The formatting is not as I would like, but it is an acceptable workaround. I am confused about the meaning of ignoreLeadingAndTrailingWhitespace=false
though. If leading whitespace is not to be ignored, why is the leading whitespace before A
not part of the testcase?
Also, I noticed something odd: if I move the closing """)
in your fixed example one tab to the left, the test fails for all three cases:
@ParameterizedTest
@CsvSource(ignoreLeadingAndTrailingWhitespace = false, textBlock = """
A,1
\u0000,1
B\u0000,2
""")
void testWithUnquotedInput(String testcase, Integer expectedLength) {
assertNotNull(testcase);
assertEquals(expectedLength, testcase.length());
}
Why would one less trailing whitespace character matter in this case?
Thank you for explaining!
You're welcome!
For me it's confusing if
@CsvSource
has a different definition of whitespace than the Java SDK itself. Sincejava.lang.Character.isWhitespace('\u0000')
returnsfalse
, I would not have thought it to be considered whitespace. Still, a third party library could maintain its own definition.
I understand how that can be confusing.
To be honest, I was not aware of the difference with the Univocity parser's default behavior, and I doubt anyone else on the JUnit team was aware of that either.
If I understood the documentation correctly, the difference is due to the fact that some databases include control characters in their exported CSV files which are typically ignored when importing or working with those CSV files.
Looking at the Javadoc of
@CsvSource
it mentions the term whitespace but doesn't clarify which code points it considers whitespace.
As I mentioned above, we were unaware of that difference.
Maybe this could be added to the Javadoc?
Yes, we can definitely update the Javadoc to make that explicit.
However, I'd first like to discuss these topics within the team before committing to anything concrete.
Thus, the following passes without any modifications to JUnit Jupiter, since
C\u0000D
contains\u0000
between other non-whitespace characters.Ah, this works around the issue, but makes test less readable because I have to deliberately insert characters I don't want to test for.
I was not suggesting that you use that as a workaround. Rather, I was merely pointing out how things work with the default CSV parser settings.
Similarly, the following also passes without any modifications to JUnit Jupiter by setting
ignoreLeadingAndTrailingWhitespace = false
and removing all whitespace between columns and the delimiters.This works! The formatting is not as I would like, but it is an acceptable workaround.
I'm glad to hear that's a suitable workaround for you. 👍
I am confused about the meaning of
ignoreLeadingAndTrailingWhitespace=false
though. If leading whitespace is not to be ignored, why is the leading whitespace beforeA
not part of the testcase?
There is no whitespace before A
in that example. See below.
Also, I noticed something odd: if I move the closing
""")
in your fixed example one tab to the left, the test fails for all three cases:Why would one less trailing whitespace character matter in this case?
If you move the closing """
, you have introduced intentional whitespace in the String.
This is simply how text blocks in Java work.
The documentation in the User Guide for @CsvSource
states the following:
It is therefore recommended that the closing text block delimiter (""") be placed either at the end of the last line of input or on the following line, left aligned with the rest of the input.
And:
Java’s text block feature automatically removes incidental whitespace when the code is compiled. However other JVM languages such as Groovy and Kotlin do not. Thus, if you are using a programming language other than Java and your text block contains comments or new lines within quoted strings, you will need to ensure that there is no leading whitespace within your text block.
I suggest you read that link which points to the Programmer's Guide to Text Blocks.
Hopefully that clarifies things!
I suggest you read that link which points to the Programmer's Guide to Text Blocks.
Hopefully that clarifies things!
Cheers, there is a lot more to text blocks than I knew! It makes perfect sense now.
Description
I am writing unit tests where test cases have input strings with (non-printable) control characters. These characters generally occupy code points U+0000 through U+001F. When using
@CsvSource
I am finding that using control character literals in strings behaves differently from printable characters.For example:
\u0000
literal is translated tonull
.\u0000
literal is translated to an empty string""
.This behavior is observed with both Eclipse's internal JUnit 5 test runner and with Maven's Surefire plugin. I have considered the possible impact of the
nullValues
parameter of@CsvSource
. This attribute defaults to{}
, so translation tonull
or an empty string is therefore not expected.Steps to reproduce
The test below should pass, but unexpectedly fails for
@CsvSource
test cases that have\u0000
literals.Context
Used versions
Build Tool/IDE