daeho-ro commented 1 year ago

If there is no header row, you may want to use the default column names. Of course it is not easy to judge that but, at least we know that there is the case.

header row has the empty value
header row has the duplicated value

You could add more cases but I am only dealing with those two cases. I have tested golangci-lint at local.

Example query results

Results

1. When header row has the duplicated value, ``` > cat test.csv a,,b,b,c 1,1,1,1,1 2,2,2,2,2 3,3,3,3,3 4,4,4,4,4 ``` the query result is given as follows: ``` > select * from test +-----+-----+-----+-----+-----+---------------------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _ctx | +-----+-----+-----+-----+-----+---------------------------+ | 2 | 2 | 2 | 2 | 2 | {"connection_name":"csv"} | | 4 | 4 | 4 | 4 | 4 | {"connection_name":"csv"} | | a | | b | b | c | {"connection_name":"csv"} | | 1 | 1 | 1 | 1 | 1 | {"connection_name":"csv"} | | 3 | 3 | 3 | 3 | 3 | {"connection_name":"csv"} | +-----+-----+-----+-----+-----+---------------------------+ ``` 3. When header row has the duplicated value, ``` > cat test.csv a,a,b,b,c 1,1,1,1,1 2,2,2,2,2 3,3,3,3,3 4,4,4,4,4 ``` the query result is given as follows: ``` > select * from test +-----+-----+-----+-----+-----+---------------------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _ctx | +-----+-----+-----+-----+-----+---------------------------+ | 3 | 3 | 3 | 3 | 3 | {"connection_name":"csv"} | | a | a | b | b | c | {"connection_name":"csv"} | | 1 | 1 | 1 | 1 | 1 | {"connection_name":"csv"} | | 2 | 2 | 2 | 2 | 2 | {"connection_name":"csv"} | | 4 | 4 | 4 | 4 | 4 | {"connection_name":"csv"} | +-----+-----+-----+-----+-----+---------------------------+ ```

daeho-ro commented 1 year ago

BOM related, not used now

During the header names set, I can handle the BOM bytes which causes an error for the several issues. (I modified my comment to unlink the issue because I will not use this code but is already fixed for the BOM bytes.) When I set the headers, simply remove the prefix unicode `\uFEFF` for the first header and that is it. Here is what I tested. I prepare the file which has the BOM bytes, you can see that the first 3 bytes `efbbbf` indicate BOM. ``` xxd test2.csv 00000000: efbb bf61 2c62 2c63 0d0a 622c 322c eb82 ...a,b,c..b,2,.. 00000010: 980d 0a63 2c33 2ceb 8ba4 0d0a 642c 342c ...c,3,.....d,4, 00000020: eb9d bc0d 0a65 2c35 2ceb a788 0d0a 662c .....e,5,.....f, 00000030: 362c ebb0 940d 0a67 2c37 2cec 82ac 0d0a 6,.....g,7,..... ``` Then, query makes a table but still with the BOM bytes! ``` > .inspect test2 +--------+-------+-------------------------------------------------------+ | column | type | description | +--------+-------+-------------------------------------------------------+ | _ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | | b | text | Field 1. | | c | text | Field 2. | | a | text | Field 0. | +--------+-------+-------------------------------------------------------+ ``` Column order should be `a`, `b`, and `c` but the `a` has the BOM bytes and so it has the hidden character, mix the order. After I add the removing process, ``` > .inspect test2 +--------+-------+-------------------------------------------------------+ | column | type | description | +--------+-------+-------------------------------------------------------+ | _ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | | a | text | Field 0. | | b | text | Field 1. | | c | text | Field 2. | +--------+-------+-------------------------------------------------------+ ``` The column order is fixed and can query for the column `a`. ``` > select a from test2 +---+ | a | +---+ | b | | g | | d | | f | | e | | c | +---+ ```

e-gineer commented 1 year ago

I merged a commit an hour ago to skip BOM bytes ... sorry, I didn't realize you were working on it ... might be worth checking that out to combine with your header work here - commit ba897d3f03d2bad06770699e76f81812a4133b93

daeho-ro commented 1 year ago

@e-gineer Oh, I even don't know about that! I will merge your code and update pr soon! Then I can remove my BOM updates.

daeho-ro commented 1 year ago

I reset my code base to the most recent commit, and remain only codes for the default column name setting.

cbruno10 commented 1 year ago

@daeho-ro This is an interesting case, as right now the CSV plugin is designed to fail when an invalid header row is passed, e.g., not all values are defined, duplicate header values. Our original thinking was that in order for the plugin to create tables with proper column names, we'd require a header row.

However, according to RFC 4180:

There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).

So if we follow RFC 4180, the header could be optional, and the plugin could use the logic from this PR:

Try to parse the first row as the header row
If there are no missing values and no duplicate values, assume it's a valid header and use it to create column names
Else, assume there is no header row, and use generic column names

For column names, I think I'd avoid starting them with _, as this is reserved for Steampipe columns like _ctx, so column names could end up being c0, c1, c2, etc.

Alternatively, we could follow what we do in the Google Sheets plugin, which uses the following conventions as per Table Restrictions and Notes:

CSV tables will only be created for sheets that have data in cell A1.
If a sheet's header row is missing some values, the table will use the column index for the column name.
If a sheet's header row has more than one column with same name, column indexes will be appended onto the end of duplicate columns.
If a sheet's header row has vertically merged cells, the table will use the merged cell's value for all affected cells and apply duplicate protection.

The first and last items don't apply to this table, but we could take alternatively take the same approach in this plugin, but use numbers instead of letters throughout the column names.

@daeho-ro Please let me know if you have any thoughts on the alternative approach I outlined above (or about anything else I mentioned).

@johnsmyth I'd also be interested in if you have any thoughts on this topic, thanks!

daeho-ro commented 1 year ago

If the header is present and corrupted, it looks good if it was handled by Google Sheets. But there are many cases where the header is missing, and I'm thinking about this case. In this case, the first row is just a row of data and should not be modified. So I want to add a default header to save the first row in case the header is malformed. In order to prevent data tampering, I rule out the case of modifying the header and this is the reason why I'd like to add the default header.

daeho-ro commented 1 year ago

I think it is possible to give an option and then user can select the behavior how the plugin set the header. Maybe header option with the values always true or false and auto.

daeho-ro commented 1 year ago

I turn this pr to draft because it could conflict with the prior pr #41.

There are a lot of changes of my code and the aim is to keep the original behavior. To use the updated header feature, user should turn the option to "auto" and it will change the header only if the header has the empty value or duplicated value. When it can be thought as the standard, you can change it to be the default behavior later.

cbruno10 commented 1 year ago

Thanks for the ideas and implementation, @daeho-ro, we really appreciate all of the thought and work you put into this PR!

turbot / steampipe-plugin-csv

set the default column names for some cases #42

Example query results