rmyorston / busybox-w32

WIN32 native port of BusyBox.
https://frippery.org/busybox
Other
677 stars 124 forks source link

Handle UTF-16 input file #318

Closed ale5000-git closed 1 year ago

ale5000-git commented 1 year ago

I need to handle it with just busybox-w32.

My idea was this:

iconv -o ./modified_file.csv -f UTF-16LE -t UTF-8 ./original_file.csv
repl="$(printf '%b' '\0e2' '\080' '\09d')"
sed -i "s/${repl:?}/\"/g" ./modified_file.csv
iconv -o ./modified_file.csv -c -f UTF-8 -t ASCII ./modified_file.csv

The sed line doesn't work, it should replace the UTF-8 with a normal ". Any suggestion?

ale5000-git commented 1 year ago

I have mixed octal and hex, now I'm a step ahead:

iconv -o output.csv -f UTF-16LE -t UTF-8 supported_devices_orig.csv
repl="$(printf '%b' '\xe2' '\x80' '\x9d')"
sed -i "s/${repl:?}/\"/g" ./output.csv
iconv -o output.csv -c -f UTF-8 -t ASCII output.csv

But the last line give an empty file.

rmyorston commented 1 year ago

The last command has the same file for input and output. That works on Linux but doesn't seem to work with busybox-w32 iconv.

ale5000-git commented 1 year ago

Is there any possibility for iconv to handle internally a temp file or if not possible detect the case and return failure without changing the file?

rmyorston commented 1 year ago

Sure, I'm looking into it now.

ale5000-git commented 1 year ago

Thanks a lot :)

rmyorston commented 1 year ago

OK, the latest prerelease binaries create a temporary file for output and rename it on completion. Similar to what sed does.

ale5000-git commented 1 year ago

It works fine thanks.

I have just a question for my code above, is there any way to use \342\200\235 directly in sed without the printf or not?

ale5000-git commented 1 year ago

If possible could you also implement iconv --version? So I can easily distinguish it from the annoying GNU libiconv that doesn't support -o

The output of others:

iconv (GNU libiconv 1.16)
Copyright (C) 2000-2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Bruno Haible.
iconv (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
rmyorston commented 1 year ago

sed in BusyBox has this:

    /* Lie to autoconf when it starts asking stupid questions. */
    if (argv[1] && strcmp(argv[1], "--version") == 0) {
        puts("This is not GNU sed version 4.0");
        return 0;
    }

Which, of course, isn't a lie.

rmyorston commented 1 year ago

is there any way to use \342\200\235 directly in sed without the printf or not?

Not that I can see. Handling of backslash escapes in upstream BusyBox sed is quite limited.

If possible could you also implement iconv --version?

If we want actual version information it would probably be best to implement --version for all applets. Like how they (mostly) all support --help.

The sed case, though, illustrates another possible requirement: to pretend to be compatible with common applications to fool things like autoconf.

ale5000-git commented 1 year ago

Thanks, for sed it isn't a problem, I will just leave the variable.

In my case I write my code personally so I can make what I want but in some case I have to distinguish between various versions when the support of parameters is different.

ale5000-git commented 1 year ago

It is a thing not really connected to busybox but do you know why GNU libiconv has return value 1 even though it does the conversion with this? iconv -c -f 'UTF-8' -t 'WINDOWS-1252' ./input.txt 1> ./output.txt || echo "Fail ${?}"

If the file is UTF-8 with BOM it return 1 instead if it is UTF-8 withOUT BOM it return 0

rmyorston commented 1 year ago

Sorry, I don't know anything about GNU libiconv. I wasn't even aware it existed until you mentioned it. I only knew of the GNU libc iconv.

ale5000-git commented 1 year ago

I have it because it is included in both Git and Ruby for Windows.

I have noticed that it works: busybox iconv -c -f 'UTF-8' -t 'LATIN1//translit' ./input.txt 1> ./output.txt but this doesn't give an error: busybox iconv -c -f 'UTF-8' -t 'LATIN1//wrongtext' ./input.txt 1> ./output.txt

rmyorston commented 1 year ago

GNU libc iconv silently ignores LATIN1//wrongtext too.

ale5000-git commented 1 year ago

GNU libiconv say:

iconv: conversion to LATIN1//wrogntext unsupported
iconv: try 'iconv -l' to get the list of supported encodings
rmyorston commented 1 year ago

I've looked at the code. The two GNU implementations handle things quite differently:

ale5000-git commented 1 year ago

The major problem is that they aren't listed anywhere, not in iconv --help and not even in iconv -l.

It is possible maybe to list the supported ones in iconv --help? Otherwise one (that doesn't know them and just trust the --help) never know.