rrthomas / recode

Charset converter tool and library
GNU General Public License v3.0
130 stars 12 forks source link

Rename fails in some circumstances on WSL #39

Open catharsis71 opened 2 years ago

catharsis71 commented 2 years ago

I am running Ubuntu WSL

I was running the following to find files with Windows-1252 encoding and convert them to UTF-8:

LC_ALL=C.UTF-8 find . -type f \( -name '*txt' -or -name '*html' -or -name '*htm' \) -exec grep -laxv '.*' {} + | xargs uchardet | grep WINDOWS-1252 | cut -d: -f1 | xargs -n 1 recode -t 'windows-1252..UTF-8'

Unfortunately it seems that recode does not work properly with NTFS filesystems at all

I ended up with hundreds of these messages:

recode: chmod (./path/path/rec4308.tmp): Operation not permitted
recode: chmod (./path/path/rec4309.tmp): Operation not permitted
recode: chmod (./path/path/rec4310.tmp): Operation not permitted
recode: chmod (./path/path/rec4311.tmp): Operation not permitted
recode: chmod (./path/path/rec4312.tmp): Operation not permitted
recode: chmod (./path/path/rec4313.tmp): Operation not permitted
recode: chmod (./path/path/rec4314.tmp): Operation not permitted
recode: chmod (./path/path/rec4315.tmp): Operation not permitted
recode: chmod (./path/path/rec4316.tmp): Operation not permitted
recode: chmod (./path/path/rec4317.tmp): Operation not permitted
recode: chmod (./path/path/rec4318.tmp): Operation not permitted
recode: chmod (./path/path/rec4319.tmp): Operation not permitted
recode: chmod (./path/path/rec4320.tmp): Operation not permitted
recode: chmod (./path/path/rec4321.tmp): Operation not permitted

All of the original files (and all filename information) are gone

The .tmp files are in fact UTF-8 but there's no way to know what the original filenames were so the files are effectively gone/useless

even if I had the original filenames there's no way to know which .tmp file correlates with which original filename

it's not uncommon for Linux programs to not work perfectly on NTFS but I've never encountered anything this bad before

I "lost" nearly 400 files and it would have been more if I hadn't noticed the errors and aborted the job

Here's an example using a single file:

$ file testfile
testfile: HTML document, ASCII text, with very long lines, with LF, NEL line terminators
$ uchardet testfile
WINDOWS-1250
$ recode -t 'windows-1250..UTF-8' testfile
recode: chmod (rec5087.tmp): Operation not permitted
$ ls testfile
ls: cannot access 'testfile': No such file or directory
$ ls *.tmp
rec5087.tmp
$ file *.tmp
rec5087.tmp: HTML document, UTF-8 Unicode text, with very long lines
$ uchardet *.tmp
WINDOWS-1250
$

With a single file it's not a big deal to rename the .tmp file back to the original filename (as long as you have the original filename) but when many files are affected it seems impossible to recover from, especially if you don't have the original filenames.

I verified the same thing happens even without the -t

I verified that this happens on both NTFS and exFAT but does NOT happen on FAT32

this issue might or might not be specific to WSL systems; a pure Linux system with an NTFS or exFAT filesystem mounted might or might not behave differently; I'm unable to test this

rrthomas commented 2 years ago

Ouch! Sorry about your data loss, and thanks for such a detailed report. A first look at the code leaves me baffled, as the only call to chmod() is made without caring about whether it succeeds (it's done on a "best effort" basis), so I can't work out how that error message is being emitted. I will investigate further.

Please could you confirm what version of recode you're using?

catharsis71 commented 2 years ago

I'm using the Ubuntu package which seems to be 3.6-24... am I in the wrong place? I didn't look too closely at the package info yesterday so I might be in the wrong place.

Package: recode
Version: 3.6-24
Priority: optional
Section: text
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Santiago Vila <sanvila@debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 209 kB
Depends: libc6 (>= 2.4), librecode0 (>= 3.6)
Download-Size: 111 kB

The --version on mine doesn't seem like it's been updated in quite a while

$ recode --version
Free recode 3.6
Written by Franc,ois Pinard <pinard@iro.umontreal.ca>.

Copyright (C) 1990, 92, 93, 94, 96, 97, 99 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
rrthomas commented 2 years ago

I would certainly appreciate a report against the latest 3.7.x. I am working with Debian to get 3.7 packaged, but it's taking a while!

catharsis71 commented 2 years ago

Okay on 3.7.12 it seems to work properly on NTFS but still fails in the same way on exFAT, albeit with a different error message, same end result though -- original file gone, .tmp file retained, and no easy way to figure out which .tmp file came from which file if you ran it on a lot of files.

exFAT filesystem:

:/mnt/d$ recode --version
recode 3.7.12
Written by François Pinard <pinard@iro.umontreal.ca>.

Copyright (C) 1990-2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
:/mnt/d$ echo HELLO > testing.txt
:/mnt/d$ file testing.txt
testing.txt: ASCII text
:/mnt/d$ uchardet testing.txt
ASCII
:/mnt/d$ recode utf8..utf16 testing.txt
/home/username/bin/.libs/recode: rename (/mnt/d/recode-ALN8Z9.tmp, /mnt/d/testing.txt): No such file or directory
:/mnt/d$ ls -l testing.txt
ls: cannot access 'testing.txt': No such file or directory
:/mnt/d$ ls -l *.tmp
-rwxrwxrwx 1 root root    60 Mar 28 09:02 rec10721.tmp
-rwxrwxrwx 1 root root 47307 Mar 28 01:27 rec5097.tmp
-rwxrwxrwx 1 root root    14 Mar 28 09:52 recode-ALN8Z9.tmp
:/mnt/d$ file recode-ALN8Z9.tmp
recode-ALN8Z9.tmp: Big-endian UTF-16 Unicode text
:/mnt/d$ uchardet recode-ALN8Z9.tmp
UTF-16
:/mnt/d$

NTFS filesystem:

:/mnt/g$
:/mnt/g$ echo HELLO > testing.txt
:/mnt/g$ file testing.txt
testing.txt: ASCII text
:/mnt/g$ uchardet testing.txt
ASCII
:/mnt/g$ recode utf8..utf16 testing.txt
:/mnt/g$ file testing.txt
testing.txt: Big-endian UTF-16 Unicode text
:/mnt/g$ uchardet testing.txt
UTF-16
:/mnt/g$

I saw there was a --verbose option so I gave it a try on exFAT but it didn't provide a whole lot of info

Request: UTF-8..:iconv:..UTF-16
Shrunk to: UTF-8..UTF-16
Request: UTF-8..ISO-10646-UCS-4..UTF-16
Recoding /mnt/d/test2.txt... done
/home/username/bin/.libs/recode: rename (/mnt/d/recode-Li08fA.tmp, /mnt/d/test2.txt): No such file or directory

This could possibly be a WSL bug or incompatibility. I should probably file a bug with WSL but it'd be useful to know if this happens on a pure Linux system to or only with WSL. Unfortunately I'm not able to test on pure Linux currently.

Just throwing out ideas, maybe the temp file name could just be the real filename with .tmp appended to it? That way if something goes wrong and you end up with a bunch of .tmp files you at least know what the names are supposed to be.

catharsis71 commented 2 years ago

Doing further testing... it looks like 3.7.12 is broken on FAT32 filesystems as well, even though the older version worked properly on FAT32...

:/mnt/e$ echo HELLO > temp.txt
:/mnt/e$ recode --verbose utf8..utf16 temp.txt
Request: UTF-8..:iconv:..UTF-16
Shrunk to: UTF-8..UTF-16
Request: UTF-8..ISO-10646-UCS-4..UTF-16
Recoding /mnt/e/temp.txt... done
/home/username/bin/.libs/recode: rename (/mnt/e/recode-hAtOC2.tmp, /mnt/e/temp.txt): No such file or directory
:/mnt/e$ ls -l temp.txt
ls: cannot access 'temp.txt': No such file or directory
:/mnt/e$ ls -l *.tmp
-rwxrwxrwx 1 cmcphers cmcphers 14 Mar 28 10:30 recode-TIBDLW.tmp
-rwxrwxrwx 1 cmcphers cmcphers 14 Mar 28 10:32 recode-hAtOC2.tmp
:/mnt/e$ file (/mnt/e/recode-hAtOC2.tmp,
-bash: syntax error near unexpected token `/mnt/e/recode-hAtOC2.tmp,'
:/mnt/e$ file /mnt/e/recode-hAtOC2.tmp
/mnt/e/recode-hAtOC2.tmp: Big-endian UTF-16 Unicode text
:/mnt/e$ uchardet /mnt/e/recode-hAtOC2.tmp
UTF-16
:/mnt/e$

So in summary:

3.6-24 -- works on FAT32 but broken on exFAT and NTFS
3.7.12 -- works on NTFS but broken on exFAT and FAT32

No issues on my native WSL filesystem which I think is ext4, but my space there is limited so I basically have to use my mounted Windows drives for a lot of stuff.

rrthomas commented 2 years ago

Thanks for your further investigation.

If I try your example on a FAT32 filing system attached to my Ubuntu machine, it works fine, so the filing system doesn't appear to matter.

As you've observed, it's the rename() system call that is failing, so the data is not (fortunately!) being lost in this case. However, I agree it would be no fun trying to recover it from the temporary files. The No such file or directory error suggests that the rename() routine, at least, is having trouble with the filename. I assume that /mnt/e is really /home?

At first blush it looks as though for some reason rename() doesn't understand the filename while the other routines that open the file etc. understand it just fine. I'm afraid I don't know much about WSL, so I have no idea why this would be.

rrthomas commented 2 years ago

gnulib has a rename() wrapper that recode is not currently using, but it doesn't have any code in it that references WSL. Some of the rename tests do mention WSL, but none of the cases tested seem to relate to this one.

catharsis71 commented 2 years ago

I filed a bug on WSL: https://github.com/microsoft/WSL/issues/8201

The error message on 3.7 is definitely more useful than what I was initially encountering on 3.6, because it does actually show the original & temporary filename together. So on 3.7 at least, if the program output hasn't been lost, renaming the files back manually isn't a huge deal. The files I converted yesterday on 3.6 though are probably a lost cause

rrthomas commented 2 years ago

Sorry I can't help more, and I hope someone at MS or with better knowledge of WSL can work out what's going wrong here. There may well be a fix or workaround even if it's not a recode bug.