noborus / trdsql

CLI tool that can execute SQL queries on CSV, LTSV, JSON, YAML and TBLN. Can output to various formats.
https://noborus.github.io/trdsql/
MIT License
1.96k stars 73 forks source link

UTF-8 Byte Order Markers should be ignored #156

Open jetzerb opened 3 years ago

jetzerb commented 3 years ago

If a data file includes Byte Order Markers (in my case, UTF-8 BOM ef bb bf), those bytes should be ignored by trdsql, but instead are currently treated as part of the data file:

$ trdsql -ih 'select [Service Type] from data.csv limit 1'
2021/05/13 10:37:05 export: no such column: Service Type [select [Service Type] from `data.csv` limit 1]

$ sed -n '1{s/,.*//; p;}' data.csv  | tee >(hexyl)
Service Type
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ef bb bf 53 65 72 76 69 ┊ 63 65 20 54 79 70 65 0a │×××Servi┊ce Type_│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

trdsql currently thinks the "Service Type" column name is prefixed with the 3 byte BOM:

$ col=$(printf "%b" '\xef\xbb\xbfService Type'); echo $col | hexyl; trdsql -ih -oat "select [$col] from data.csv limit 1"
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ef bb bf 53 65 72 76 69 ┊ 63 65 20 54 79 70 65 0a │×××Servi┊ce Type_│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
+--------------------------------+
|          Service Type          |
+--------------------------------+
| Construction - General         |
+--------------------------------+

I don't know if other encodings also have byte order markers, but I see that Go's standard library will apparently never handle BOMs so each application must deal with them individually 😞

noborus commented 3 years ago

Thank you for reporting the issue. I know it doesn't work with UTF-8 BOM, but I don't want to support it if possible. It may be dealt with in the future, but I would like you to deal with it in other ways if possible.