Closed assarbad closed 2 years ago
The following .tar file contains the file names that cause the error renamer-error.tar.gz.
This should hopefully ensure that the file names appear on the target system the same as on mine. All files have been truncated to zero size for convenience.
The issue appears to be around the second code point in the file name starting with Z
.
After unpacking this .tar file on a n ext4 and btrfs partition, respectively, I am seeing the same issue with renamer, so it's not specific to NTFS or the ufsd (commercial) driver from Paragon.
For reference, this is the renaming I am attempting:
$ xxd -g 1 before-renaming.txt
00000000: 44 69 65 20 49 6e 76 65 73 74 6d 65 6e 74 62 61 Die Investmentba
00000010: 6e 6b 20 42 61 72 69 6e 67 73 20 42 61 6e 6b 20 nk Barings Bank
00000020: 5b 31 37 31 30 31 38 5f 62 61 72 69 6e 67 73 5f [171018_barings_
00000030: 69 6e 66 5d 2e 6d 70 34 0a 44 69 65 20 49 6e 76 inf].mp4.Die Inv
00000040: 65 73 74 6d 65 6e 74 62 61 6e 6b 20 4c 65 68 6d estmentbank Lehm
00000050: 61 6e 20 42 72 6f 74 68 65 72 73 20 5b 31 37 31 an Brothers [171
00000060: 30 31 38 5f 6c 65 68 6d 61 6e 5f 69 6e 66 5d 2e 018_lehman_inf].
00000070: 6d 70 34 0a 5a c3 b6 6c 69 62 61 74 20 2d 20 44 mp4.Z..libat - D
00000080: 65 72 20 6b 61 74 68 6f 6c 69 73 63 68 65 20 4c er katholische L
00000090: 65 69 64 65 6e 73 77 65 67 20 5b 31 30 38 30 34 eidensweg [10804
000000a0: 36 34 35 5d 2e 6d 70 34 0a 645].mp4.
$ xxd -g 1 after-renaming.txt
00000000: 44 69 65 20 49 6e 76 65 73 74 6d 65 6e 74 62 61 Die Investmentba
00000010: 6e 6b 20 42 61 72 69 6e 67 73 20 42 61 6e 6b 20 nk Barings Bank
00000020: 5b 31 37 31 30 31 38 5f 62 61 72 69 6e 67 73 5f [171018_barings_
00000030: 69 6e 66 5d 2e 6d 70 34 0a 44 69 65 20 49 6e 76 inf].mp4.Die Inv
00000040: 65 73 74 6d 65 6e 74 62 61 6e 6b 20 4c 65 68 6d estmentbank Lehm
00000050: 61 6e 20 42 72 6f 74 68 65 72 73 20 5b 31 37 31 an Brothers [171
00000060: 30 31 38 5f 6c 65 68 6d 61 6e 5f 69 6e 66 5d 2e 018_lehman_inf].
00000070: 6d 70 34 0a 5a c3 b6 6c 69 62 61 74 20 2d 20 44 mp4.Z..libat - D
00000080: 65 72 20 6b 61 74 68 6f 6c 69 73 63 68 65 20 4c er katholische L
00000090: 65 69 64 65 6e 73 77 65 67 2e 6d 70 34 0a eidensweg.mp4.
The respective part (Zölibat
) seems not to change at all, yet it's the part renamer
complains about:
5a c3 b6 6c 69 62 61 74
5a c3 b6 6c 69 62 61 74
As far as I could ascertain the c3 b6
translates fine to ö
(in HTML ö
).
Found it. rust-gdb
was a real help.
Okay, so the problem was introduced by #48. I am still trying to reason about it, but the author - @mtimkovich - is probably better equipped to help out here. It looks to me as if the code makes undue assumptions about what the (renamed) string ought to look like.
master
(currently at 56055f4271b2571e6928109dd9c4bfd0fd7c2526)Cargo.toml
(especially [profile.dev]
is relevant)cargo build
renamer-error.tgz
env RUST_BACKTRACE=1 rust-gdb -ex "b Rename::new" -ex run --args /path/to/target/debug/renamer *
n
(for next
) when back at the GDB prompt (note: after issuing a command once, you can hit Enter and it will get repeated, so instead of n
plus Enter again you can simply hit Enter to run next
again)Cargo.toml
# xref: https://bitshifter.github.io/rr+rust/#11
# The development profile, used for `cargo build`
[profile.dev]
opt-level = 0 # Controls the --opt-level the compiler builds with
debug = true # Controls whether the compiler passes `-g`
# The release profile, used for `cargo build --release`
[profile.release]
opt-level = 3
debug = false
I am cutting out some redundant text output for brevity. Replaced lines have been substituted by a [...]
:
$ env RUST_BACKTRACE=1 rust-gdb --args ../target/debug/renamer *
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
[...]
Reading symbols from ../target/debug/renamer...
[...]
Breakpoint 1 at 0x4f908: file src/main.rs, line 55.
[...]
Starting program: /home/oliver/pipe-rename/target/debug/renamer Die\ Investmentbank\ Barings\ Bank\ \[171018_barings_inf\].mp4 Die\ Investmentbank\ Lehman\ Brothers\ \[171018_lehman_inf\].mp4
Zölibat\ -\ Der\ katholische\ Leidensweg\ \[10804645\].mp4
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after vfork from child process 34368]
Breakpoint 1, renamer::Rename::new (original="Zölibat - Der katholische Leidensweg [10804645].mp4", new="Zölibat - Der katholische Leidensweg.mp4") at src/main.rs:55
55 let mut new = new.to_string();
(gdb) n
56 if let Ok(home) = env::var("HOME") {
(gdb)
57 if &new[..2] == "~/" {
(gdb)
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside 'ö' (bytes 1..3) of `Zölibat - Der katholische Leidensweg.mp4`', library/core/src/str/mod.rs:127:5
stack backtrace:
0: rust_begin_unwind
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:142:14
2: core::str::slice_error_fail_rt
3: core::ops::function::FnOnce::call_once
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/function.rs:248:5
4: core::intrinsics::const_eval_select
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/intrinsics.rs:2696:5
5: core::str::slice_error_fail
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/str/mod.rs:86:9
6: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/str/traits.rs:294:21
7: core::str::traits::<impl core::ops::index::Index<I> for str>::index
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/str/traits.rs:65:9
8: <alloc::string::String as core::ops::index::Index<core::ops::range::RangeTo<usize>>>::index
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/string.rs:2324:10
9: renamer::Rename::new
at /home/oliver/pipe-rename/src/main.rs:57:17
10: renamer::find_renames::{{closure}}
at /home/oliver/pipe-rename/src/main.rs:147:22
11: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/function.rs:290:13
12: core::iter::traits::iterator::Iterator::find_map::check::{{closure}}
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:2732:32
13: core::iter::traits::iterator::Iterator::try_fold
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:2238:21
14: core::iter::traits::iterator::Iterator::find_map
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:2738:9
15: <core::iter::adapters::filter_map::FilterMap<I,F> as core::iter::traits::iterator::Iterator>::next
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/adapters/filter_map.rs:61:9
16: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
17: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/vec/spec_from_iter.rs:33:9
18: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/vec/mod.rs:2648:9
19: core::iter::traits::iterator::Iterator::collect
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:1836:9
20: renamer::find_renames
at /home/oliver/pipe-rename/src/main.rs:140:27
21: renamer::main
at /home/oliver/pipe-rename/src/main.rs:351:28
22: core::ops::function::FnOnce::call_once
at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[Inferior 1 (process 34363) exited with code 0145]
What this shows us is that this is the problematic line.
In particular the issue seems to be that the slice created is a slice of bytes and therefore attempts to cut a Unicode codepoint right in its middle, which makes the underlying code barf.
example.tar.gz (Rust Playgound link%3B%0A%7D%0A%0Afn%20main()%20%7B%0A%20%20%20%20let%20new%3A%20%26str%20%3D%20%22Z%C3%B6libat%22%3B%20%2F%2F%20input%20as%20per%20line%2053%0A%20%20%20%20print_type_of(%26%22new%22%2C%20%26new)%3B%0A%20%20%20%20let%20new%20%3D%20new.to_string()%3B%20%2F%2F%20corresponds%20to%20line%2055%20(just%20without%20mut)%0A%20%20%20%20print_type_of(%26%22mut%20new%22%2C%20%26new)%3B%0A%20%20%20%20print_type_of(%26%22'~%2F'%22%2C%20%26%22~%2F%22)%3B%0A%20%20%20%20let%20first_two%20%3D%20%26new%5B..2%5D%3B%20%2F%2F%20the%20problematic%20part%20of%20line%2057%20(!!!)%0A%20%20%20%20print_type_of(%26%22first_two%22%2C%20%26first_two)%3B%0A%20%20%20%20println!(%22%7B%7D%22%2C%20new)%3B%0A%7D%0A)) shows the issue a little better. Perhaps String::from_utf8()
could be used to make the types in that condition on line 57 more compatible without having to care too much about what exactly is in the string at runtime?!
fn print_type_of<T>(name: &str, _: &T) {
println!("{}: {}", name, std::any::type_name::<T>());
}
fn main() {
let new: &str = "Zölibat"; // input as per line 53
print_type_of(&"new", &new);
let new = new.to_string(); // corresponds to line 55 (just without mut)
print_type_of(&"mut new", &new);
print_type_of(&"'~/'", &"~/");
let first_two = &new[..2]; // the problematic part of line 57 (!!!)
print_type_of(&"first_two", &first_two);
println!("{}", new);
}
Debug with rust-gdb -ex 'b main.rs:11' -ex run --args ./target/debug/teststr
This answer appears to explain it even better and more succinctly than I ever could.
The answer appears to be to use if new.starts_with("~/") {
instead of that slice of bytes.
Rust Playground link%3B%0A%7D%0A%0Afn%20main()%20%7B%0A%20%20%20%20let%20new%3A%20%26str%20%3D%20%22Z%C3%B6libat%22%3B%20%2F%2F%20input%20as%20per%20line%2053%0A%20%20%20%20print_type_of(%26%22new%22%2C%20%26new)%3B%0A%20%20%20%20let%20mut%20new%20%3D%20new.to_string()%3B%20%2F%2F%20corresponds%20to%20line%2055%0A%20%20%20%20print_type_of(%26%22mut%20new%22%2C%20%26new)%3B%0A%20%20%20%20print_type_of(%26%22'~%2F'%22%2C%20%26%22~%2F%22)%3B%0A%20%20%20%20if%20new.starts_with(%22~%2F%22)%20%7B%0A%20%20%20%20%20%20%20%20new%20%3D%20new.replacen(%22~%22%2C%20%22whatever%22%2C%201)%3B%0A%20%20%20%20%7D%0A%20%20%20%20println!(%22%7B%7D%22%2C%20new)%0A%7D%0A)
I think the tests could use some more wacky codepoints, too. Be it German umlauts, Icelandic characters (Þ,Ð ...), Cyrillic or Chinese characters 😉 ... just thinking out loud here.
Good idea with getting weirder with the test cases.
When on Ubuntu 20.04 with UTF-8 as code page I am seeing the following:
For the issue to surface I need to:
I am still collecting more information on this on my end, but I suspect some sort of normalization issue. This happens on an NTFS drive (which may play a role) with the commercial version of the Paragon NTFS driver (may also play a role). Could also have to do with NTFS as storage (which uses UTF-16 internally and may use other Unicode normalization rules).
I'll have to try to narrow this down further, but wanted to already bring this to your attention to the extent to which I have gathered information.
What would be great to see, however, would be the real bytes underneath. I mean I trust Rust to figure out if we end up in the middle of a surrogate or so, but since environmental factors like (potentially) normalization on part of the editor, the file system etc could play a role, it would be great to see input and output as raw bytes in case of this sort of panic.