Missing optimization opportunity: non-power-of-two integer loading with fewer movs

Kojoley commented 5 years ago


Bugzilla Link	41216
Version	trunk
OS	Windows NT
CC	@RKSimon,@zygoloid,@rotateright

Extended Description

Currently non-power-of-two integers loads are done byte-per-byte:

#include <cstdint>

// could be two loads instead of three: (u16 << 8) | u8
std::uint32_t foo_24(unsigned char const* p)
{
    return static_cast<std::uint32_t>(p[0])
        | (static_cast<std::uint32_t>(p[1]) << 8)
        | (static_cast<std::uint32_t>(p[2]) << 16)
        ;
}

// could be two loads instead of five: (u32 << 8) | u8
std::uint64_t foo_40(unsigned char const* p)
{
    return static_cast<std::uint64_t>(p[0])
        | (static_cast<std::uint64_t>(p[1]) << 8)
        | (static_cast<std::uint64_t>(p[2]) << 16)
        | (static_cast<std::uint64_t>(p[3]) << 24)
        | (static_cast<std::uint64_t>(p[4]) << 32)
        ;
}

// could be two loads instead of six: (u32 << 16) | u16
std::uint64_t foo_48(unsigned char const* p)
{
    return static_cast<std::uint64_t>(p[0])
        | (static_cast<std::uint64_t>(p[1]) << 8)
        | (static_cast<std::uint64_t>(p[2]) << 16)
        | (static_cast<std::uint64_t>(p[3]) << 24)
        | (static_cast<std::uint64_t>(p[4]) << 32)
        | (static_cast<std::uint64_t>(p[5]) << 40)
        ;
}

// could be three loads instead of seven: (u32 << 24) | (u16 << 8) | u8
std::uint64_t foo_56(unsigned char const* p)
{
    return static_cast<std::uint64_t>(p[0])
        | (static_cast<std::uint64_t>(p[1]) << 8)
        | (static_cast<std::uint64_t>(p[2]) << 16)
        | (static_cast<std::uint64_t>(p[3]) << 24)
        | (static_cast<std::uint64_t>(p[4]) << 32)
        | (static_cast<std::uint64_t>(p[5]) << 40)
        | (static_cast<std::uint64_t>(p[6]) << 48)
        ;
}

https://godbolt.org/z/Re7dWL

GCC produces better code (however currently it optimizes only 32bit loads https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89809)

RKSimon commented 4 years ago

Since 10.0 foo_24 and foo_40 now optimally combine their loads. foo_48 and foo_56 still only combine the lower i32 loads - the remainder upper bytes are still loaded separately.

RKSimon commented 5 years ago

DAGCombiner::MatchLoadCombine should probably be able to handle partial loads like these.

llvm / llvm-project

Missing optimization opportunity: non-power-of-two integer loading with fewer movs #40561

Extended Description