w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
161 stars 57 forks source link

UAX35 numeric datatype format pattern ambiguity #894

Closed drexem closed 3 months ago

drexem commented 10 months ago

So I am developing CSV validator according to CSVW recommendations in C#.

I have some questions about the interpretation of number patterns defined for numeric datatypes . If I understand it correctly there are two cases to consider:

  1. when the number pattern contains 'E' and is therefore in a Scientific Notation
  2. default

So firstly my questions about the first case (1.): a. In the linked document is is stated that: "The number of digit characters after the exponent character gives the minimum exponent digit count. There is no maximum". What is the digit character exatly? Is it the character '0' or '#' or both? Are patterns like this: 0E0##0# correct? b. What gives the minimum number of integer digits in the pattern? Is it the leftmost '0' in the integer part of the pattern? c. What gives the minimum number of fractional digits in the pattern? Is it the rightmost '0' in the fractional part? d. Can the groupChar be in the exponent part of the pattern? Questions about second case (2.): a. What gives the minimum number of integer digits in the pattern? Is it the leftmost '0' in the integer part of the pattern? b. What gives the minimum number of fractional digits in the pattern? Is it the rightmost '0' in the fractional part? c. How should the grouping separator be treated in the fractional and the exponent part? The document states the following:

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It is commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

however based on the integration tests I am not sure how exactly should the grouping separator work in such cases. d. What gives the maximum number of fractional digits in the pattern? Based on the Validation Test 304 I would say that it is the number of characters [#0].

One last question that is not related to any of these cases. If the character '+' is present in the pattern it means that before the number there must be plus sign in front of the non-negative numbers. When the '+' character is not present it means that the plus sign in fron of non-negative numbers (or exponent) is optional?

Thank you for your help in advance.

gkellogg commented 10 months ago

Recognize that these questions are mostly relevant to UAX35 and not CSVW specifically. UAX35's primary consideration is for formatting data, not parsing. We really just use it for parsing.

So firstly my questions about the first case (1.): a. In the linked document is is stated that: "The number of digit characters after the exponent character gives the minimum exponent digit count. There is no maximum". What is the digit character exatly? Is it the character '0' or '#' or both?

In the number pattern, # represents a digit from 0 through 9, or an appropriate glyph based on the locale. 0 means to zero pad, as necessary. In either case, it's the actual digit that will be used after applying the pattern. In the case of the mantissa, there can be any number of more significant digits, as necessary.

Are patterns like this: 0E0##0# correct?

Depending on the implementation you use, that might not be valid: on the left-hand-side of a decimal point, you'd expect a 0 to proceed any #, and on the left-hand-side, to follow an # to indicate padding.

b. What gives the minimum number of integer digits in the pattern? Is it the leftmost '0' in the integer part of the pattern?

The number of # or 0 indicates the template to apply. If a 0, smaller numbers are padded out to have at least that many digits. If a #, no extra padding is added.

c. What gives the minimum number of fractional digits in the pattern? Is it the rightmost '0' in the fractional part?

Yes, but smaller fractional parts will be matched. As mentioned, UAX35 is mostly about emitting numbers rather than parsing them. We restrict ourselves to parsing numbers.

d. Can the groupChar be in the exponent part of the pattern?

My implementation doesn't support this, but maybe?

Questions about second case (2.): a. What gives the minimum number of integer digits in the pattern? Is it the leftmost '0' in the integer part of the pattern?

I don't think there are a minimum number of digits to be matched.

b. What gives the minimum number of fractional digits in the pattern? Is it the rightmost '0' in the fractional part?

Same.

c. How should the grouping separator be treated in the fractional and the exponent part? The document states the following:

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It is commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

however based on the integration tests I am not sure how exactly should the grouping separator work in such cases.

For parsing purposes, I think the group character is largely ignored. At least, it is in my implementation.

I think it would be reasonable for a tool meant to validate input data to take a more strict interpretation of number patterns (or other patterns, for that matter) and report on fields in the input that don't correspond to the pattern.

d. What gives the maximum number of fractional digits in the pattern? Based on the Validation Test 304 I would say that it is the number of characters [#0].

Generally, UAX35 defines a grammar for these patterns, and patterns that don't match that pattern would be considered an error. The last entry in the referenced table is a pattern with no characters, which would seem to be legitimate.

One last question that is not related to any of these cases. If the character '+' is present in the pattern it means that before the number there must be plus sign in front of the non-negative numbers. When the '+' character is not present it means that the plus sign in fron of non-negative numbers (or exponent) is optional?

I believe so. It's intended to describe if a "+" should be used for numbers when formatting. It would largely be ignored when parsing.

Thank you for your help in advance.

Sorry, it's been a long time since we worked on these specs, and the details are a bit fuzzy without really diving back into it now.

drexem commented 9 months ago

Thanks for your explanation!

I have a question though. So the test 304 fails because there is empty line in the referenced table? Not because the table entry "12.34,567" does not match the specified pattern "#0.0#,#"?

There are also multiple other tests that contain such empty entry (303,302,301,...). Do these also fail for the same reason?