kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

send/receive fails with "invalid tlv in cmd tlv type = 816" #770

Closed dehnhard closed 3 weeks ago

dehnhard commented 3 months ago

Hi,

I setup a btrbk remote backup that transfers btrfs subvolumes. Source is a Debian 12 amd64 machine targeting a Debian 12 armel device (An older Zyxel NSA325v2 on which the proprietary OS is replaced with a Debian installation). btrfs-progs version is 6.2-1 on both sides (current in the stable bookworm repo).

The btrbk log shows the called send and receive commands and the resulting error:

ERROR: Failed to send/receive subvolume: /home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102  -> dbackup.lan:/home/dnotebook/testbackup/testbackup.20240330T1102
ERROR: ... Command execution failed (exitcode=1)
ERROR: ... sh: btrfs send '/home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102' | ssh dnotebook@dbackup.lan 'sudo -n btrfs receive '\''/home/dnotebook/testbackup/'\'''
ERROR: ... invalid tlv in cmd tlv_type = 816

Any ideas?

Forza-tng commented 3 months ago

Just of curiosity, did you recreate the filesystem after installing Debian?

dehnhard commented 3 months ago

Just of curiosity, did you recreate the filesystem after installing Debian?

No, I created the filesystem (a btrfs RAID1 with two harddisks) in the current Debian installation. Nothing further.

Another notice: The armel device also has a newer kernel with possible different config, but the btrfs module is loaded and everything else works so far.

Forza-tng commented 2 months ago

@dehnhard I believe the Marwell SoC is an ARM64, and Arm in general isn't as widely tested as AMD64.

It might be good to test newer btrfs-progs and kernel if possible. Latest btrfs-progs is v6.8. Since Debian doesn't have it, you can use the static builds as they do not need installation and can be used on any Linux system.

amd64: https://github.com/kdave/btrfs-progs/releases arm64: https://mirrors.tnonline.net/btrfs/btrfs-progs/arm64/

The arm64 build is provided by me as Github unfortunately doesn't have them yet. Build instructions if you want to compile yourself https://wiki.tnonline.net/w/Btrfs/Statically_built_btrfs-progs

But first, let's check if the error is on the sending or the receiving side.

Is there an error on the sending side? You can easily check if sending works properly using

btrfs send '/home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102' > /dev/null

If all is ok it should look like

# btrfs -vv send home.20240406T1301/ > /dev/null
Protocol version requested: 1 (supported 3)
At subvol home.20240406T1301/
BTRFS_IOC_SEND returned 0

Now check if there is an error on the receiving side. Can you check its dmesg? If dmesg has no errors you can test receiving from a local file.

  1. btrfs send '/home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102' > /path/to/file
  2. copy 'file' to the remote
  3. on the remote btrfs receive /home/dnotebook/testbackup/ < 'file'
dehnhard commented 2 months ago

@Forza-tng thanks for making it easy with those good instructions!

I believe the Marwell SoC is an ARM64, and Arm in general isn't as widely tested as AMD64.

No, it's called Kirkwood (see Info from OpenWRT here), I believe the official Debian platform is called armel. The Kernel is coming from here, a great guy keeps providing new kernels for these old machines, which are capable enough for tasks like NAS for backups. uname -a says Linux dbackup 6.7.5-kirkwood-tld-1 #1 PREEMPT Sat Feb 17 16:27:56 PST 2024 armv5tel GNU/Linux

It might be good to test newer btrfs-progs and kernel if possible.

I can certainly test new builds and would appreciate static built binaries for that. I'd prefer to avoid compiling and can't do a reinstall on that machine.

But first, let's check if the error is on the sending or the receiving side.

  1. Sending looks good.
    # sudo btrfs -vv send '/home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102' > /dev/null
    Protocol version requested: 1 (supported 2)
    At subvol /home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102
    BTRFS_IOC_SEND returned 0
  2. dmesg shows nothing
  3. Retrieve fails with that error
    # btrfs -vv receive /home/dnotebook/testbackup/ < /home/dnotebook/SENT
    At subvol testbackup.20240330T1102
    receiving subvol testbackup.20240330T1102 uuid=3ea2d27c-bacb-2141-9a98-6d3620a2642e, stransid=165679
    chown  - uid=1000, gid=1000
    chmod  - mode=0755
    utimes 
    ERROR: invalid tlv in cmd tlv_type = 816

    The receive command creates an empty subvolume with a wrong user, but correct uid (different on both ids). I guess thats expected? It's only relevant to the test setup. The real backup is uid 0 on both machines.

Forza-tng commented 2 months ago

@Forza-tng thanks for making it easy with those good instructions!

I believe the Marwell SoC is an ARM64, and Arm in general isn't as widely tested as AMD64.

No, it's called Kirkwood (see Info from OpenWRT here), I believe the official Debian platform is called armel. The Kernel is coming from here, a great guy keeps providing new kernels for these old machines, which are capable enough for tasks like NAS for backups. uname -a says Linux dbackup 6.7.5-kirkwood-tld-1 #1 PREEMPT Sat Feb 17 16:27:56 PST 2024 armv5tel GNU/Linux

Thanks for explaining. According to https://www.debian.org/ports/arm/index.en.html armel is an 32bit arm architecture.

It might be good to test newer btrfs-progs and kernel if possible.

I can certainly test new builds and would appreciate static built binaries for that. I'd prefer to avoid compiling and can't do a reinstall on that machine.

I don't have a 32bit buildhost for arm, so at the moment i cant make a static build.

But first, let's check if the error is on the sending or the receiving side.

  1. Sending looks good.
# sudo btrfs -vv send '/home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102' > /dev/null
Protocol version requested: 1 (supported 2)
At subvol /home/daniel/Experimente/backup/btrbk_snapshots/testbackup.20240330T1102
BTRFS_IOC_SEND returned 0

This is good.

  1. dmesg shows nothing
  2. Retrieve fails with that error
# btrfs -vv receive /home/dnotebook/testbackup/ < /home/dnotebook/SENT
At subvol testbackup.20240330T1102
receiving subvol testbackup.20240330T1102 uuid=3ea2d27c-bacb-2141-9a98-6d3620a2642e, stransid=165679
chown  - uid=1000, gid=1000
chmod  - mode=0755
utimes 
ERROR: invalid tlv in cmd tlv_type = 816

The receive command creates an empty subvolume with a wrong user, but correct uid (different on both ids). I guess thats expected? It's only relevant to the test setup. The real backup is uid 0 on both machines.

Did you try the receive as root user?

You could try using send protocol 2:

btrfs send --proto 2 > file
btrfs receive --proto 2 < file

If this is not working perhaps you could ask in #btrfs irc channel or take it to the maling list to get some input from the developers.

dehnhard commented 2 months ago

Did you try the receive as root user?

Yes, it fails with the same message. Also it works with local source and target as root and with an without --proto 2. So it looks like something in the data from the source is unknown to the receiving side.

What about looking at the semantics? As far as I have been able to look, there is a variable called tlv_type in the btrfs-progs code. What does tlv_type 816 mean and why might it be unknown for the receiver?

For compiling a static binary I'll have to see when I can put some time aside to try. I hope the instructions in the README/INSTALL docs are somewhat straightforward.

kdave commented 2 months ago

ERROR: ... invalid tlv in cmd tlv_type = 816

The command types and attributes are defined at https://github.com/kdave/btrfs-progs/blob/master/kernel-shared/send.h#L128, the maximum for attributes is value 35. The reported error is way off, so it's either a corruption in the stream, or the Zyxel NAS is using a custom stream extension.

It's known that Synology has such extension and the stream is not compatible with upstream, most likely Zyxel did something like that too.

It would be possible to work around that, but ideal solution is to poke Zyxel to send the extensions upstream.

kdave commented 2 months ago

I could not find any downloadable sources on Zyxel sites (forums, community, general search). If it exists please post the link, linux kernel and btrfs-progs. We can add it as the protocol extensions are possible.

Forza-tng commented 2 months ago

It's known that some nas vendors have a customised btrfs that isn't fully compatible. It's why I asked if the filesystem was created after debian was installed. It unfortunately seems it wasn't. @dehnhard can you confirm?

dehnhard commented 2 months ago

@Forza-tng Maybe I misunderstood what you meant in your first question with "recreate"? The btrfs filesystem was created after the Debian installation using the pure Debian binaries on new and empty harddisks. In fact I did nothing with the original NAS firmware than directly replacing it with Debian.

@kdave Thanks for the explanation. Something like a custom stream extension sounds like a possible culprit. Is my understanding correct, that it's like a special hardware mode similar to a GPU or Crypto extension which would need some specific commands / driver code to use it correctly?

I'd be happy to poke around for driver code or protocol specifications, but please give me some details about what to ask for exactly. Would such extensions exist in userland binaries, the Linux kernel, the btrfs kernel module or even in bootloader code (which is U-Boot, also replaced with a newer version while installing Debian)?

I think it's also possible, that Zyxel never implemented any extension but bought them with the SoC from Marvell. In that case it would probably be better to poke the latter?

Forza-tng commented 2 months ago

@dehnhard if you use plain Debian tools, then it couldn't be anything to do with the SoC, unless the tools and kernel from Debian are custom Zyxel made?

@kdave is it tested to work to do send|receive from a 64bit system to a 32bit one?

rdrnau commented 1 month ago

I am experiencing the same issue in my custom Buildroot installation on an older Linksys DNS-320 with a Marvell ARMv5 Xscale processor, which is the same model used in the Zyxel NSA325v2.

The backup process works correctly when using the Buildroot-supported Btrfs-progs version 5.16. But when attempting to use any newer Btrfs-progs versions, such as 6.6.2, the reported issue appears.

dehnhard commented 1 month ago

I repeat, there is nothing Zyxel made involved in my installation. I imagine that any possible modification which could affect an SSH stream must be very low level. But @rdrnau s observation, that an older version did work for him points to a possible regression in btrfs-progs, doesn't it?

Forza-tng commented 1 month ago

I think we can rule out custom modifications by Zyxel, etc.

Shame on me 🙄, but I dumped the send.h together with the he error code from the original post into ChatGPT...

The suggestion is that there is an alignment issue in how the stream is serialized on 64bit send side compared to how it is handled on the 32bit side:

Example: Sending side: tlv_type =16

| tlv_type (2 bytes) | tlv_len (2 bytes) |
|        0x0010      |       0x0002      |

Could become on the receiving side:

| Misaligned Read (4 bytes)                |
| 0x10 0x00 0x02 0x00 (0x0002 0x0000) => 816 |

There also seems to be a mismatch on the tlv_len datatype.

send.h

struct btrfs_tlv_header {
    __le16 tlv_type;
    /* len excluding the header */
    __le16 tlv_len;
} __attribute__ ((__packed__));

send-stream.c

struct btrfs_send_attribute {
    u16 tlv_type;
    /*
     * Note: in btrfs_tlv_header, this is __le16, but we need 32 bits for
     * attributes with file data as of version 2 of the send stream format.
     */
    u32 tlv_len;
    char *data;
};
kdave commented 1 month ago

@kdave is it tested to work to do send|receive from a 64bit system to a 32bit one?

This should work but such combinations are rare and not tested, typically it would be the same 32/32 or 64/64. The observation with the type mismatches looks plausible. Last change was to the v2 protocol update, something could have gone wrong there.

kdave commented 1 month ago

Most likely cf269aa47bce1fb9902738eace0afa7bca8f6d1b + aa1ca3789e9ba8acbfe6bd31aacef5d1f38fa942 added in v5.19, though I still don't see how exactly it could happen.

dehnhard commented 1 month ago

aa1ca37

The direct cast to __le16 instead of first casting to btrfs_tlv_header looks suspicious to me. Might not work in this case because of 32 vs 64bit or it might be an endian issue. AFAIK 32bit ARM CPUs normally are little-endian but can be switched to big-endian with the twist that they use a different big-endian mode than modern ARM64 (BE-32 instead of BE-8). Don't know how C compilers address this though.

kdave commented 1 month ago

That might be it. From the commit:

tlv_type = le16_to_cpu(*(__le16 *)data);

this is (now) obviously wrong. The leXX_to_cpu marcros take the value and swap bytes if needed, but the value here is read by compiler in the host order taht may not be LE. The right thing is get_unaligned_le16, that takes the raw bytes and does the conversion.

kdave commented 1 month ago

Can you please test this (also availabe in branch ci/fix-receive-be):

--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -183,7 +183,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
                        ret = -EINVAL;
                        goto out;
                }
-               tlv_type = le16_to_cpu(*(__le16 *)data);
+               tlv_type = get_unaligned_le16(data);

                if (tlv_type == 0 || tlv_type > __BTRFS_SEND_A_MAX) {
                        error("invalid tlv in cmd tlv_type = %hu", tlv_type);
@@ -204,7 +204,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
                                ret = -EINVAL;
                                goto out;
                        }
-                       send_attr->tlv_len = le16_to_cpu(*(__le16 *)data);
+                       send_attr->tlv_len = get_unaligned_le16(data);
                        pos += sizeof(__le16);
                        data += sizeof(__le16);
                }
kdave commented 1 month ago

I still am not sure if I understand what's going on. If this was a simple endianity bug then the value 816 would give some hint. It's 0x330, so byteswap 0x3003 = 12291, and that's not a valid type either. The error appears after utime stream command, which is type 20 = 0x14. The v2 extension to increase length to u32 is only for data. Utime, chmod, chown are all in the v1 of protocol so either the bug has always been there and LE->BE never worked, or there's another 64/32 bit issue. Could be also partially caused on the kernel side, we've had issues with the 64/32 bit compatibility.

dehnhard commented 1 month ago

Can you please test this (also availabe in branch ci/fix-receive-be):

If no one else does it first, I can do that. No promises though when I'll find the time between private obligations and other hobbies.

OTOH if anybody is interested in paying a freelancer I can offer to analyse the 64/32 bit and data type compatibility situation that you are facing here. I have done similar work before.

rdrnau commented 1 month ago

It was easy to reproduce the bug caused by reading a 16-bit word from an uneven address, resulting in byte flipping on the ARMv5 and causing an error. I have compiled the 'ci/fix-receive-be' branch and have made preliminary tests. The error message 'ERROR: invalid tlv in cmd tlv_type = 816' no longer appears when using the 'ci/fix-receive-be' branch with both protocols.

I compiled a static binary using buildroot cross-compiling and am currently creating a test backup for later comparison. Do you recommend running additional tests ?

rdrnau commented 1 month ago

A 360gb backup was perfect using patched branch. But how about this: In libbtrfs/send.h the following struct is defined

struct btrfs_cmd_header { / len excluding the header / le32 len; le16 cmd; / crc including the header with zero crc field / le32 crc; } attribute ((packed__));

In send-stream.c line 143, there is the line 'crc = le32_to_cpu(sctx->cmd_hdr->crc);'

The crc field starts at offset 6, which is not a multiple of 4. This means it is not aligned to a 4-byte boundary. Processors like ARMv5 require 32-bit integers to be 4-byte aligned. Should the line be 'crc = get_unaligned_le32(sctx->cmd_hdr->crc);'

rdrnau commented 1 month ago

There is more. line 144 'sctx->cmd_hdr->crc = 0;' should be 'put_unaligned_le32(0, sctx->cmd_hdr->crc)'

in common/send-stream.c line 163 'crc = le32_to_cpu(cmd_hdr->crc);', line 165 'cmd_hdr->crc = 0;'

kdave commented 1 month ago

Thank you very much for testing. Agreed about the alignment and that it's basically everywhere, alle the le32_to_cpu that's indirectly from a pointer cast to btrfs_cmd_header from raw buffer.

kdave commented 1 month ago

AFAIK the strict alignment architectures can work in several modes, where an unaligned read is a hard error, or an exception (fixup mode?) that can be handled by kernel that emulates the read/write instructions.

rdrnau commented 1 month ago

AFAIK the strict alignment architectures can work in several modes, where an unaligned read is a hard error, or an exception (fixup mode?)

An unaligned access mode was introduced in ARMv6, but it is commonly not used. In ARMv7, the mode was improved. This issue may concern Rapsberry Pi 1 & Zero users, that are based on ARMv6.

kdave commented 1 month ago

I've pushed update to the branch ci/fix-receive-be, also with fix to libbtrfs (still incomplete, version number needs to be bumped too).

rdrnau commented 1 month ago

Integers in btrfs_dir_item are byte aligned since 'struct btrfs_disk_key' it contains is 17 bytes long. libbtrfsutil/subvolume.c: line 610: name_len = le16_to_cpu(dir->name_len); - should it be ... get_unaligned_le16(dir->name_len);

cmds/rescue-chunk-recover.c: ??? u8 expected_csum[BTRFS_CSUM_SIZE]; ... put_unaligned_le32(tree_csum, expected_csum); - I guess it works with crc32c

image/image-restore.c: line 1025: item = &cluster->items[i]; bufsize = le32_to_cpu(item->size); - should it be ... get_unaligned_le32(item->size); item_bytenr = le64_to_cpu(item->bytenr); - should it be ... get_unaligned_le64(item->bytenr);

line 1131, 1133, 1139, 1145, 1151, 1168, 1314, 1315

Both 'item->size' and 'item->bytenr' are byte aligned, evident by these struct's: struct meta_cluster_header { le64 magic; le64 bytenr; le32 nritems; u8 compress; } attribute ((packed__));

/ cluster header + index items + buffers / struct meta_cluster { struct meta_cluster_header header; struct meta_cluster_item items[]; } attribute ((packed));

line 1201: mdres->devid = le64_to_cpu(super->dev_item.devid); - dev_item is at an uneven byte address in struct btrfs_super_block line 1286

Give it a big eyeball :)

rdrnau commented 1 month ago

The performance of Btrfs can be significantly improved on hardware that requires software fallback to handle word alignment corrections. Specifically, when get_unaligned_le64 is called since it forces software handling, which on ARMv5 compiles into a sequence of 16 instructions instead of a single load instruction. By changing the stack accessing to unaligned in a patch you made there is a speed regression on the armel ARMv5 and similar platforms by making every read significantly slower. I patched my kernel years ago for speeding it up. And it works.

I propose the following patch for Btrfs-progs. Additionally, I recommend implementing a similar patch for the Btrfs filesystem itself. This optimization will significantly enhance the performance for users of ARMv5 based NAS servers, possibly Raspberry Pi 1 and other similar hardware platforms.

--- a/libbtrfs/ctree.h    2024-05-24 01:12:51.506765058 +0100
+++ b/libbtrfs/ctree.h   2024-05-26 16:35:52.476518069 +0100
@@ -1615,7 +1615,16 @@
 static inline u##bits btrfs_##name(const struct extent_buffer *eb)     \
 {                                                                      \
        const struct btrfs_header *h = (struct btrfs_header *)eb->data; \
-       return le##bits##_to_cpu(h->member);                            \
+       switch (offsetof(type, member) & 7)                             \
+       {                                                               \
+       case 0:                                                         \
+               return le##bits##_to_cpu(h->member);                    \
+       case 4:                                                         \
+               if (bits <= 32) return le##bits##_to_cpu(h->member);    \
+               return get_unaligned_le##bits(&h->member);              \
+       default:                                                        \
+               return get_unaligned_le##bits(&h->member);              \
+       }                                                               \
 }                                                                      \
 static inline void btrfs_set_##name(struct extent_buffer *eb,          \
                                    u##bits val)                        \
@@ -1643,11 +1652,27 @@
 #define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits)             \
 static inline u##bits btrfs_##name(const type *s)                      \
 {                                                                      \
-       return le##bits##_to_cpu(s->member);                            \
+       switch (offsetof(type, member) & 7)                             \
+       {                                                               \
+               case 0:                                                 \
+                       return le##bits##_to_cpu(s->member);            \
+               case 4:                                                 \
+                       if (bits <= 32)                                 \
+                               return le##bits##_to_cpu(s->member);    \
+                       return get_unaligned_le##bits(&s->member);      \
+               default:                                                \
+                   return get_unaligned_le##bits(&s->member);          \
+       }                                                               \
 }                                                                      \
 static inline void btrfs_set_##name(type *s, u##bits val)              \
 {                                                                      \
-       s->member = cpu_to_le##bits(val);                               \
+       switch (offsetof(type, member) & 7)                             \
+       {                                                               \
+               case 0:                                                 \
+                       s->member = cpu_to_le##bits(val);               \
+                       break;                                          \
+               default:                                                \
+                       put_unaligned_le##bits(val, &s->member);        \
+       }                                                               \
 }

 BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
kdave commented 1 month ago

I'm afraid the above won't work, the offsetof(type, member) value is computed at compile time (the offset of the member does not change), however the structure is in general mapped over a raw buffer where the bytes representing the structure are not properly aligned.

kdave commented 1 month ago

I think switch ((s + offsetof(type, member)) & 7 could work, but I currently don't have a way to verify that it's correct (assembly and that test's work). I understand the workardound with the stack value is making performance worse, that should be fixed. The access to unaligned method is what kernel does (https://elixir.bootlin.com/linux/latest/source/include/asm-generic/unaligned.h) ie. forcing a packed structure and compiler understands it's potentially unaligned and deals with that.

If this preserves the performance on arm5 I'd like to stick to it (both libs).

kdave commented 1 month ago

I have a rasperrypi 1 set up now so I can do some tests.

kdave commented 1 month ago

The build is sluggish (CPU overclocked to 1000MHz but still), it may take an hour before doing one test round so I'll give it a few tries but RPi 1 does not seem to be worth the time.

Forza-tng commented 1 month ago

The build is sluggish (CPU overclocked to 1000MHz but still), it may take an hour before doing one test round so I'll give it a few tries but RPi 1 does not seem to be worth the time.

Perhaps qemu emulation is faster. Cannot imagine that an rpi is going anywhere fast with compiling :/

rdrnau commented 1 month ago

I doubt if the patch will be effective since packed structs enforce 'unaligned safe reads' for all their members, regardless of alignment. Perhaps modifying the patch in this manner would be more effective:

#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits)          \
static inline u##bits btrfs_##name(const type *s)                   \
{                                                                   \
    const u8 *sm = __builtin_assume_aligned(&s->member, bits/8);    \
    switch (offsetof(type, member) & 7) {                           \
    case 0:                                                         \
        return le##bits##_to_cpu(*(const u##bits *)sm);             \
rdrnau commented 1 month ago

I have revised the patch and measured the difference in the generated binaries by counting the number of load byte instructions. The difference was minimal. When I did a similar patch for the same macros in the Linux kernel, the size of the Btrfs module was reduced by about 2% in size. Nonetheless, here is the modified patch:

diff -uNr btrfs-progs/libbtrfs/ctree.h btrfs-progs-patched/libbtrfs/ctree.h
--- btrfs-progs/libbtrfs/ctree.h    2024-05-28 23:00:18.031368257 +0100
+++ btrfs-progs-patched/libbtrfs/ctree.h    2024-05-29 00:21:26.758389956 +0100
@@ -1614,14 +1614,33 @@
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)        \
 static inline u##bits btrfs_##name(const struct extent_buffer *eb) \
 {                                  \
-   const struct btrfs_header *h = (struct btrfs_header *)eb->data; \
-   return get_unaligned_le##bits(&h->member);          \
+   const struct btrfs_header *h =                  \
+       (const struct btrfs_header *)eb->data;          \
+   const u8 *hm = (const u8 *)__builtin_assume_aligned(&h->member, \
+       bits/8);                        \
+   switch (offsetof(type, member) & 7) {               \
+   case 0:                             \
+       return le##bits##_to_cpu(*(const u##bits *)hm);     \
+   case 4:                             \
+       if (bits <= 32)                     \
+           return le##bits##_to_cpu(*(const u##bits *)hm); \
+       return get_unaligned_le##bits(hm);          \
+   default:                            \
+       return get_unaligned_le##bits(hm);          \
+   }                               \
 }                                  \
 static inline void btrfs_set_##name(struct extent_buffer *eb,      \
                    u##bits val)            \
 {                                  \
    struct btrfs_header *h = (struct btrfs_header *)eb->data;   \
-   h->member = cpu_to_le##bits(val);               \
+   u8 *hm = __builtin_assume_aligned(&h->member, bits/8);      \
+   switch (offsetof(type, member) & 7) {               \
+   case 0:                             \
+       *(u##bits*)hm = cpu_to_le##bits(val);           \
+       break;                          \
+   default:                            \
+       put_unaligned_le##bits(val, hm);            \
+   }                               \
 }

 #define BTRFS_SETGET_FUNCS(name, type, member, bits)           \
@@ -1643,11 +1662,28 @@
 #define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits)     \
 static inline u##bits btrfs_##name(const type *s)          \
 {                                  \
-   return get_unaligned_le##bits(&s->member);          \
+   const u8 *sm = __builtin_assume_aligned(&s->member, bits/8);    \
+        switch (offsetof(type, member) & 7) {              \
+   case 0:                             \
+       return le##bits##_to_cpu(*(const u##bits *)sm);     \
+   case 4:                             \
+       if (bits <= 32)                     \
+           return le##bits##_to_cpu(*(const u##bits *)sm); \
+       return get_unaligned_le##bits(&s->member);      \
+   default:                            \
+       return get_unaligned_le##bits(&s->member);      \
+   }                               \
 }                                  \
 static inline void btrfs_set_##name(type *s, u##bits val)      \
 {                                  \
-   s->member = cpu_to_le##bits(val);               \
+   u8 *sm = __builtin_assume_aligned(&s->member, bits/8);      \
+   switch (offsetof(type, member) & 7) {                           \
+   case 0:                             \
+       *(u##bits *)sm = cpu_to_le##bits(val);          \
+       break;                          \
+   default:                            \
+       put_unaligned_le##bits(val, &s->member);        \
+   }                               \
 }

 BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
kdave commented 1 month ago

The counter example I had in mind (and haven't verified regarding the last diff you sent) is like this:

#include <stdint.h>                                                                                                                                                                                                  

typedef uint64_t u64;
typedef uint32_t u32;

struct btrfs_struct {
        u32 x;
        u64 ull;
} __attribute__((__packed__));

// Assume BTRFS_SETGET_FUNCS-like helper, and that we have the
// get_unaligned_leXX helpers
#define read_member(name, type, member, bits)           \
static inline u##bits btrfs_##name(const type *s) {     \
        return get_unaligned_le##bits(&s->member);      \
}

read_member(struct_ull, struct btrfs_struct, ull, 64);
read_member(struct_x,   struct btrfs_struct,   x, 32);

int main() {
        char buffer[4096];

        fill_with_data(buffer);
        for (int i = 0; i < 128; i++) {
                u64 tmp;

                // This must work for any offset into the input buffer
                tmp = btrfs_struct_ull(buffer + i);
        }
        return 0;
}

Ie. there's a buffer where structs (that are packed, no natural alignment guaranteed) are put one after another without any space or alignment gaps. If the struct does not start aligned, the member is not aligned either, even if offsetof(type, member) would be aligned.

This is not about compile time alignment checks, at runtime we get some data and have to be sure the u64 is read properly without prior assumptions about the alignment.

If you measured 2% reduction of size this probably means that some of the checks that verify that the buffer is or is not aligned before reading eg. u64 have been dropped. IOW, this can lead to an unaligned read, when eg. a u64 is at offset 1 from the buffer.

kdave commented 1 month ago

Or maybe an even simpler example, without struct packing etc: struct btrfs_ull { u64 ull; }; and then reading ull must work when reading it from any offset of buffer. The test offsetof(btrfs_ull, ull) would be always 0, assuming proper alignment, but reading it from offsets 1..7 from buffer will be unaligned.

kdave commented 1 month ago

I'm able to catch some alignment problems with -fsanitize=alignment. I've changed the unaligned access from temporary data to the packed struct cast, kernel is using it and I believe we can't do much better (namely to get a compiler-only based solution).

kdave commented 1 month ago

Regarding problems pointed out in https://github.com/kdave/btrfs-progs/issues/770#issuecomment-2131286964 they're not forgotten.

kdave commented 1 month ago

Tests now pass the sanitizer+alignment test, this is still reported only for code that run during the tests but the coverage is sufficient (send/receive, image, search ioctl).

rdrnau commented 1 month ago

@kdave, I am looking at your correcting "cmds/inspect.c" and I might be looking at a contradiction. Example: chunk->sub_stripes = get_unaligned_le16(&item->sub_stripes); If 'item->sub_stripes' is unaligned, then 'chunk->sub_stripes' is also unaligned. In case they are both unaligned, the code should be. put_unaligned_16(get_unaligned_le16(&item->sub_stripes), &chunk->sub_stripes); But they are not unaligned, I have not seen any evidence of this from looking at the structs that chunk and item contain. Because at the start the function, this struct btrfs_tree_search_args args; of 64kb is initialized. Then the 'args' is by an ioctl filled with data from an file descriptor. Meaning, the content from file becomes aligned to this initialized struct. And item, being a struct within that 'args' is aligned just like chunk which is aligned by a memory allocation.

It is more suitable to use direct assignment for readability and code efficiency. Could you please verify if my understanding is correct, or if there's another reason for using unaligned access functions here?

kdave commented 4 weeks ago

If you mean code (around line 1300):

                chunk = &(*chunks)[*num_chunks];
                chunk->offset = sh.offset;
                chunk->length = get_unaligned_le64(&item->length);
                chunk->stripe_len = get_unaligned_le64(&item->stripe_len);
                chunk->type = get_unaligned_le64(&item->type);
                chunk->num_stripes = get_unaligned_le16(&item->num_stripes);
                chunk->sub_stripes = get_unaligned_le16(&item->sub_stripes);

Then it's not unaligned write to chunk-> members. It's an array of struct chunk allocated a few lines above (realloc()) and can be assumed to be aligned. The item comes from the search buffer and is considered unaligned. Even if args is on stack and thus aligned, what is in args->data is a sequence of struct btrfs_ioctl_search_header + some btrfs item (packed sctuctures in general).

It is possible that some combination of stack arguments and search buffer is aligned, but it's not easy to reason about and is not obvious from the code why it's safe. If this gets copied as a safe pattern then we'd have more unaligned access spread.