Closed daeho-ro closed 1 year ago
I merged a commit an hour ago to skip BOM bytes ... sorry, I didn't realize you were working on it ... might be worth checking that out to combine with your header work here - commit ba897d3f03d2bad06770699e76f81812a4133b93
@e-gineer Oh, I even don't know about that! I will merge your code and update pr soon! Then I can remove my BOM updates.
I reset my code base to the most recent commit, and remain only codes for the default column name setting.
@daeho-ro This is an interesting case, as right now the CSV plugin is designed to fail when an invalid header row is passed, e.g., not all values are defined, duplicate header values. Our original thinking was that in order for the plugin to create tables with proper column names, we'd require a header row.
However, according to RFC 4180:
There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).
So if we follow RFC 4180, the header could be optional, and the plugin could use the logic from this PR:
For column names, I think I'd avoid starting them with _
, as this is reserved for Steampipe columns like _ctx
, so column names could end up being c0, c1, c2, etc.
Alternatively, we could follow what we do in the Google Sheets plugin, which uses the following conventions as per Table Restrictions and Notes:
The first and last items don't apply to this table, but we could take alternatively take the same approach in this plugin, but use numbers instead of letters throughout the column names.
@daeho-ro Please let me know if you have any thoughts on the alternative approach I outlined above (or about anything else I mentioned).
@johnsmyth I'd also be interested in if you have any thoughts on this topic, thanks!
If the header is present and corrupted, it looks good if it was handled by Google Sheets. But there are many cases where the header is missing, and I'm thinking about this case. In this case, the first row is just a row of data and should not be modified. So I want to add a default header to save the first row in case the header is malformed. In order to prevent data tampering, I rule out the case of modifying the header and this is the reason why I'd like to add the default header.
I think it is possible to give an option and then user can select the behavior how the plugin set the header. Maybe header option with the values always true or false and auto.
I turn this pr to draft because it could conflict with the prior pr #41.
There are a lot of changes of my code and the aim is to keep the original behavior. To use the updated header feature, user should turn the option to "auto" and it will change the header only if the header has the empty value or duplicated value. When it can be thought as the standard, you can change it to be the default behavior later.
Thanks for the ideas and implementation, @daeho-ro, we really appreciate all of the thought and work you put into this PR!
If there is no header row, you may want to use the default column names. Of course it is not easy to judge that but, at least we know that there is the case.
You could add more cases but I am only dealing with those two cases. I have tested
golangci-lint
at local.Example query results
Results
1. When header row has the duplicated value, ``` > cat test.csv a,,b,b,c 1,1,1,1,1 2,2,2,2,2 3,3,3,3,3 4,4,4,4,4 ``` the query result is given as follows: ``` > select * from test +-----+-----+-----+-----+-----+---------------------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _ctx | +-----+-----+-----+-----+-----+---------------------------+ | 2 | 2 | 2 | 2 | 2 | {"connection_name":"csv"} | | 4 | 4 | 4 | 4 | 4 | {"connection_name":"csv"} | | a | | b | b | c | {"connection_name":"csv"} | | 1 | 1 | 1 | 1 | 1 | {"connection_name":"csv"} | | 3 | 3 | 3 | 3 | 3 | {"connection_name":"csv"} | +-----+-----+-----+-----+-----+---------------------------+ ``` 3. When header row has the duplicated value, ``` > cat test.csv a,a,b,b,c 1,1,1,1,1 2,2,2,2,2 3,3,3,3,3 4,4,4,4,4 ``` the query result is given as follows: ``` > select * from test +-----+-----+-----+-----+-----+---------------------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _ctx | +-----+-----+-----+-----+-----+---------------------------+ | 3 | 3 | 3 | 3 | 3 | {"connection_name":"csv"} | | a | a | b | b | c | {"connection_name":"csv"} | | 1 | 1 | 1 | 1 | 1 | {"connection_name":"csv"} | | 2 | 2 | 2 | 2 | 2 | {"connection_name":"csv"} | | 4 | 4 | 4 | 4 | 4 | {"connection_name":"csv"} | +-----+-----+-----+-----+-----+---------------------------+ ```