Matching a column header with leading or trailing whitespace?

dbooth-boston commented 9 years ago

I did not see any response to my 10-Jun-2015 message about this: https://lists.w3.org/Archives/Public/public-csv-wg/2015Jun/0040.html

In looking at http://w3c.github.io/csvw/syntax/ section 7.4.1 says "There are no constraints on these titles." http://w3c.github.io/csvw/syntax/#h-headers However, the grammar given in section 7.5 distinguishes between 'field' and 'rawfield', where a 'rawfield' does not start or end with whitespace. This implicitly suggests -- but does not state -- that the name of a column header should not contain leading or trailing whitespace.

Should a column header "First Name " be considered a match of "First Name"? For convenience -- and to avoid surprises -- I think it should, but I didn't see anything in the spec that explicitly answers this either way. I don't see a good rationale for allowing a column name to differ by leading or trailing whitespace.

What do others think? Has this already been discussed and decided?

gkellogg commented 9 years ago

The grammar section you reference is a non-normative re-statement of RFC4180. Specific tabular encodings may place restrictions on field contents and the neither the Tabular Data Model nor Tabular Metadata place any additional restrictions on these values.

When matching titles, they do require a code-point by code-point match. However, the trim dialect setting may cause such spaces to be removed, depending on its setting.

While we do have tests that ensure leading and trailing whitespace is preserved, as appropriate, it doesn't seem that these also look for leading and trailing whitespace around titles, which might be considered for additional tests.

dbooth-boston commented 9 years ago

Actually, when I read section 8 http://w3c.github.io/csvw/syntax/#h-headers which is marked as non-normative, I notice that in the algorithm to parse a document, step 7.3 describes how to parse header rows, and says to "parse the row to provide a list of cell values". That process in turn is defined at http://w3c.github.io/csvw/syntax/#dfn-parse-a-row where its step 4 says to "trim the current cell value". This implies that headers are always trimmed, which I think is the behavior that CSVW should specify. However, in the main algorithm for parsing at document, step 7.3.2.1 says: "If the value at index i in the list of cell values is an empty string or consists only of whitespace, do nothing.". But if header fields are always trimmed, then it would be impossible for it to consist only of whitespace, so it sounds like step 7.3.2.1 should instead say: "If the value at index i in the list of cell values is an empty string, do nothing."

If I have understood the algorithm correctly, it does (at least partly) specify the behavior that I think it should specify: that leading and trailing whitespace in a header is ignored. However: (a) I do not see where this is specified normatively; and (b) I do not see any requirement that column names or titles in the JSON metadata be trimmed. If the column names and titles in the JSON metadata are not also trimmed prior to testing Schema Compatibility, then if a CSV column header contained "Dog Breed " (note the trailing whitespace), and the JSON metadata also specified a column title of "Dog Breed " (note the trailing whitespace), then these would not be considered a match according to the Schema Compatibility rules (if I've understood them correctly). In other words, the spec should also say that the column names and titles specified in JSON metadata are also trimmed.

Again, if this is already specified normatively somewhere then please accept my apology and point me to where it is specified.

dbooth-boston commented 9 years ago

Also, it would be good to add conformance tests for the following combinations:

A CSV header called " Dog Breed" (note leading whitespace), with a corresponding JSON metadata column title called "Dog Breed" (without leading whitespace).
A CSV header called "Dog Breed " (note trailing whitespace), with a corresponding JSON metadata column title called "Dog Breed" (without trailing whitespace).
A CSV header called "Dog Breed" (without leading whitespace), with a corresponding JSON metadata column title called " Dog Breed" (note leading whitespace).
A CSV header called "Dog Breed" (without trailing whitespace), with a corresponding JSON metadata column title called "Dog Breed " (note trailing whitespace).

gkellogg commented 9 years ago

Note that this algorithm is also informative (as is everything specific to CSVs). The particular library I use does not distinguish header cells from body cells, and doesn't remove surrounding whitespace automatically. Indeed, this is why we have trim and skipInitialSpace. IMO, it would be wrong to treat header cells any differently by default.

The operative parsing section would be Parsing Tabular Data step 7.3.2.2:

Otherwise, if there is no column description object at index i in M.tableSchema.columns, create a new one with a title property whose value is an array containing a single value that is the value at index i in the list of cell values. Note that cell string values are not normalized, which is the process used for creating the cell value and that is where trim is considered.

Titles come when creating embedded metadata, which references Parsing Tabular Data. This uses the parse a row algorithm (step 7.3), which invokes the trim algorithm which uses the dialect description describing how to get values from cells.

We might consider updating test003 to add whitespace around the header cells and possibly add more tests @dbooth-boston suggests, but I believe that the behavior (in absence of specific trim dialect) would be to preserve whitespace within the embedded titles.

dbooth-boston commented 9 years ago

I just discovered that I made a mistake in understanding the algorithm in section 8 http://w3c.github.io/csvw/syntax/#parsing because 'trim' has been defined in that algorithm not to trim, but to conditionally trim: http://w3c.github.io/csvw/syntax/#dfn-trim-a-cell-value To avoid confusion and surprise, I suggest calling it 'conditionally trim' when used as a verb, such as "4 If there are no more characters to read, conditionally trim the current cell value . . . " and "To _conditionally trim a cell value_ . . . ."

Returning to the more substantive issue, I agree that 'trim' should be the default for regular data cells, but header cells are for metadata, and metadata has an entirely different purpose than regular cell data. @gkellogg , aside from making the algorithm a tiny bit simpler, can you explain why you think "it would be wrong to treat header cells any differently by default"? In particular, what user value do you see in distinguishing the column header "Dog Breed " from "Dog Breed"?
Is this something that you believe users would like to be able to do? If so, for what?

AFAICT it would only encourage bad practice and lead to frustration. A trailing space, like at the end of "Dog Breed ", is invisible in a spreadsheet. I could well imagine a user becoming frustrated not understanding why his/her CSVM processor claimed that the CSV data file schema did not match the JSON metadata, when one called the column "Dog Breed" and the other inadvertently called it "Dog Breed ".

gkellogg commented 9 years ago

@dbooth-boston ultimately, this is just an opinion, but we do have clear rules for trimming cells, and I can just as well argue that using the same logic for header cells as body cells makes sense. I've marked it for discussion on the next call.

Another alternative would be to make the default value of trim true, instead of false.

Regarding conditionally trim a cell value this is simply editorial, and I can make that change.

dbooth-boston commented 9 years ago

Thanks @gkellogg . I don't know when is the next teleconference, but two more comments:

+1 for making the default value of trim true for regular data cells. That's exactly what I've done in tabular data processing software that I've written. I think it would be very rare that a user would want to make it false.

But one point I'd like to stress: metadata is qualitatively different than data. I think the argument for trimming the headers (metadata) is even stronger that the argument for trimming the data cells by default. When I thought through this exact case in my own software a few months ago I was unable to come up with any good reason for someone to treat a header of "Dog Breed " differently from "Dog Breed" (though I saw the harm). So if you or anyone else comes up with a reason why someone might want to do so, I'd be very interested to know what it is.

iherman commented 9 years ago

@dbooth-boston I think some of the reasons are really pragmatic.

As @gkellogg said, we do not (are not supposed to!) standardize the behaviour of CSV parsers, and the available libraries are also very different in capabilities from one another. In particular, when using some libraries (I played with some implementations and, as far as I remember, this is what I ended up doing, too), the concept of header row/cells separated from the body cell fall back on the CSVW processor, while the effective parsing is done by the external library. On the other hand, trimming may very well be a parsing parameter of the library. This means that if we decided to have a separate trimming behaviour between header and body cells, we would have to do the trim on the CSVW level, and not leave it to the library; I presume that, when handling very large files, this may lead to efficiency loss.

I do not really see what would be the value of treating headers differently from bodies in general, and such pragmatic issues become then important. I definitely vote for leaving the uniformity in place.

dbooth-boston commented 9 years ago

If we are agreed that trim should default to true, then I think that addresses my concerns well enough about trimming headers: they can be treated the same as data cells. It would be very rare that people would set trim false, but if they do then they are clearly aware of the possibility of leading and trailing spaces anyway, so they'll be prepared to deal with them.

Do others agree to make trim default to true, as @gkellogg suggested?

iherman commented 9 years ago

I am fine with this

I.

On 29 Jun 2015, at 17:32 , David Booth notifications@github.com wrote:

If we are agreed that trim should default to true, then I think that addresses my concerns well enough about trimming headers: they can be treated the same as data cells. It would be very rare that people would set trim false, but if they do then they are clearly aware of the possibility of leading and trailing spaces anyway, so they'll be prepared to deal with them.

Do others agree to make trim default to true, as @gkellogg suggested?

JeniT commented 9 years ago

RESOLVED: We will change trim to default to true.

danbri commented 9 years ago

See http://www.w3.org/2015/07/01-csvw-irc#T14-10-59

w3c / csvw

Matching a column header with leading or trailing whitespace? #632