onetrueawk / awk

One true awk
Other
1.96k stars 156 forks source link

Is there a way to use 8-bit character sets in UTF-8 supported version? #190

Closed ko1nksm closed 10 months ago

ko1nksm commented 10 months ago

Thank you for maintaining this project. I am very excited about the new features UTF-8 and CSV support.

However, I am having a bit of trouble. I am using LC_ALL=C awk ... to percent-encode UTF-8 characters byte-by-byte. This doesn't seem to work with newer awk versions. The following code is a simplified version of the code I use.

$ ./a.out --version
awk version 20230911

$ echo あ | LC_ALL=C ./a.out '
  BEGIN { for(i = 0; i < 256; i++) t[sprintf("%c", i)] = sprintf("%%%02X", i) }
  { for(i = 1; i <= length($0); i++) printf "%s", t[substr($0, i, 1)] }
  END { print "" }
'
(Nothing output)

gawk works as I expect.

$ echo あ | LC_ALL=C gawk '
  BEGIN { for(i = 0; i < 256; i++) t[sprintf("%c", i)] = sprintf("%%%02X", i) }
  { for(i = 1; i <= length($0); i++) printf "%s", t[substr($0, i, 1)] }
  END { print "" }
'
%E3%81%82

Is there any workaround?

plan9 commented 10 months ago

thanks for the report. I have no workaround.

arnoldrobbins commented 10 months ago

Hi. There is no workaround. This awk is not locale-aware; UTF-8 processing is hardcoded. This could be changed, but none of us have the time / inclination to do so right now. Your best bet is to use gawk which is locale-aware, or mawk which only works with 8-bit characters. Closing this issue.

ko1nksm commented 10 months ago

OK. Thank you.