milkytracker / MilkyTracker

An FT2 compatible music tracker
http://milkytracker.github.io/
Other
1.69k stars 162 forks source link

powerpc: don't use asm optimisation with clang #210

Closed julianaito closed 4 years ago

julianaito commented 4 years ago

OpenBSD and FreeBSD use clang to build MilkyTracker, and this compiler does not like the asm optimisations used for ppc32/64.

On OpenBSD/macppc (but it seems to be the same on FreeBSD/powerpc), this generates these errors:

conv.c:78:51: error: invalid operand for instruction
DEFINE_CLIPCONVERT_POWERPC(s8,f32, -128.0, 127.0, LFSUX, LBZ_STBUX)
                                                  ^
conv.c:65:16: note: expanded from macro 'LFSUX'
#define LFSUX   "1:     lfsux f0,%1,%6          \n"
                 ^
<inline asm>:4:10: note: instantiated into assembly here
1:      lfsux f0,5,6            
              ^
conv.c:78:1: error: invalid operand for instruction
DEFINE_CLIPCONVERT_POWERPC(s8,f32, -128.0, 127.0, LFSUX, LBZ_STBUX)
^
conv.c:48:4: note: expanded from macro 'DEFINE_CLIPCONVERT_POWERPC'
                "       fsub f1,f0,%3           \n"                     \
                 ^
<inline asm>:5:7: note: instantiated into assembly here
        fsub f1,f0,2        
(and more)    

Using -fno-integrated-as did not solve the issue, it emits relocations errors. I can't test every platform using clang on ppc32/64 so maybe it would need further OS-filtering, but i wanted more opinions before eventually polishing that PR, or using some different code if my proposal is considered a bad idea :)

CC'ing @pkubaj if he's interested.

pkubaj commented 4 years ago

Building with GCC9 also fails (FreeBSD 12.1-RELEASE on powerpc64, base compiler is GCC 4.2.1):

/usr/local/bin/g++9   -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/osinterface/posix -I/usr/local/include/SDL2 -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui -I/wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/osinterface -O2 -pipe  -fstack-protector-strong -Wl,-rpath=/usr/local/lib/gcc9  -Wl,-rpath=/usr/local/lib/gcc9 -O2 -pipe  -fstack-protector-strong -Wl,-rpath=/usr/local/lib/gcc9  -Wl,-rpath=/usr/local/lib/gcc9 -MD -MT src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o -MF src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o.d -o src/ppui/CMakeFiles/ppui.dir/Graphics_15BIT.cpp.o -c /wrkdirs/usr/ports/audio/milkytracker/work/MilkyTracker-1.02.00/src/ppui/Graphics_15BIT.cpp
{standard input}: Assembler messages:
{standard input}:1802: Error: unsupported relocation against r9
{standard input}:1803: Error: unsupported relocation against r10
{standard input}:1803: Error: unsupported relocation against r5
{standard input}:1805: Error: unsupported relocation against r10
{standard input}:1805: Error: unsupported relocation against r9
{standard input}:1809: Error: unsupported relocation against r4
{standard input}:1809: Error: unsupported relocation against r3
{standard input}:1810: Error: unsupported relocation against r4
{standard input}:1810: Error: unsupported relocation against r3
{standard input}:1811: Error: unsupported relocation against r4
{standard input}:1811: Error: unsupported relocation against r3
{standard input}:1812: Error: unsupported relocation against r4
{standard input}:1812: Error: unsupported relocation against r3
{standard input}:1813: Error: unsupported relocation against r10
{standard input}:1813: Error: unsupported relocation against r10
{standard input}:1814: Error: unsupported relocation against r3
{standard input}:1814: Error: unsupported relocation against r3
{standard input}:1815: Error: unsupported relocation against r10
{standard input}:1815: Error: unsupported relocation against r9
{standard input}:1817: Error: unsupported relocation against r11
{standard input}:1817: Error: unsupported relocation against r5
{standard input}:1819: Error: unsupported relocation against r11
{standard input}:1819: Error: unsupported relocation against r9
{standard input}:1822: Error: unsupported relocation against r4
{standard input}:1822: Error: unsupported relocation against r3
{standard input}:1823: Error: unsupported relocation against r11
{standard input}:1823: Error: unsupported relocation against r11
{standard input}:1824: Error: unsupported relocation against r3
{standard input}:1824: Error: unsupported relocation against r3
{standard input}:1825: Error: unsupported relocation against r11
{standard input}:1825: Error: unsupported relocation against r9
julianaito commented 4 years ago

Building with GCC9 also fails (FreeBSD 12.1-RELEASE on powerpc64, base compiler is GCC 4.2.1):

Oh, it used to build with GCC-8.3 on OpenBSD/macppc.

It looks like we're going straight into an #ifdef hell the way i'm proposing this :|

kernigh commented 4 years ago

@julianaito: GCC-8.3 on OpenBSD doesn't define __ppc__, so it would have skipped the asm and used the C alternative. Clang does define __ppc__.

The asm is in the wrong syntax. li r9, 0 is Darwin syntax, but BSD and Linux use ELF syntax: li %r9, 0 or li 9, 0. PowerPC Darwin (Mac OS X) became obsolete in 2009, so we might want to delete the asm and just use the C code on PowerPC. If so, we might also delete __attribute__((noinline)).

Deltafire commented 4 years ago

I'm in favour of removing these asm blocks and relying on the compiler to generate any optimisations. Could someone test that there are no performance bottlenecks if we do so?

kernigh commented 4 years ago

@Deltafire: I made a simple benchmark of fill_dword() and fill_dword_vertical() for my old G3 (PowerPC 750 at 400 MHz) running OpenBSD/macppc. The C code looks about as fast as the asm code.

I needed to edit the asm code. I added some % signs for ELF syntax. Then the asm crashed because clang optimized away the function arguments. I moved the asm code to a label, and constrained bl label to require the arguments. I also tried to realign the loops in fill_dword() by deleting the first 2 nops.

gcc 8.3.0 -O2 with len = 100, pitch = 32:

$ ./timefill-n 100 32 
USE_ASM? no
fill_dword 1: 0.000007931 s
fill_dword 2: 0.000002884 s
fill_dword 3: 0.000002884 s
fdvertical 4: 0.000002443 s
fdvertical 5: 0.000002564 s
fdvertical 6: 0.000002723 s
buff is ok
$ ./timefill-y 100 32 
USE_ASM? yes
fill_dword 1: 0.000007450 s
fill_dword 2: 0.000003245 s
fill_dword 3: 0.000003005 s
fdvertical 4: 0.000002564 s
fdvertical 5: 0.000002443 s
fdvertical 6: 0.000002563 s
buff is ok

clang 8.0.1 -O2 with len = 100, pitch 32:

$ ./timefill-n 100 32 
USE_ASM? no
fill_dword 1: 0.000005807 s
fill_dword 2: 0.000005928 s
fill_dword 3: 0.000003886 s
fdvertical 4: 0.000002444 s
fdvertical 5: 0.000002524 s
fdvertical 6: 0.000002443 s
buff is ok
$ ./timefill-y 100 32 
USE_ASM? yes
fill_dword 1: 0.000007009 s
fill_dword 2: 0.000002924 s
fill_dword 3: 0.000002924 s
fdvertical 4: 0.000002604 s
fdvertical 5: 0.000002564 s
fdvertical 6: 0.000002444 s
buff is ok

The compilers and the asm use different optimizations. Both gcc and clang use mtctr/bdnz loops, and clang also uses stwu, but the asm never uses bdnz nor stwu. clang also unrolls fill_dword() to prepend a 3rd loop (of 8 stores). The asm aligns the loops to 16 bytes, but gcc and clang don't align the loops.

I removed __attribute__((noinline)) from the C code, but I didn't inline the asm code. gcc inlined fill_dword_vertical(). clang inlined both fill_dword() and fill_dword_vertical().

Deltafire commented 4 years ago

I removed the PPC assembly code and replaced with the library function memset(), which should already by optimised for the target platforms. See commit ba5699e.

kernigh commented 4 years ago

@Deltafire, does memset() really work? As I understood it, fill_dword() fills in 32-bit values, but memset() fills in 8-bit values.