steve-chamberlin / fc8-compression

A simple LZ77-based compression with fast decompression on 68K and other legacy hardware
37 stars 4 forks source link

M68K compatible #1

Open leuat opened 4 years ago

leuat commented 4 years ago

Hi there! I'm the creator of TRSE (www.turborascal.com), and I'm currently in the process of adding automatic packing + extraction of M68K resources (for the Amiga + Atari ST 520). I'm trying to get your decompression code to work on the M68K, but I'm having some problems, and was wondering if you would be able to help me

if we mange to get this work, this will be the de-facto way of compressing resources in TRSE, and you'll naturally be credited in the TRSE credits

I've compared the (compressed) files to the compressed output from fc8, and since the files are identical - the problem must reside with the decompression method. I've tried changing it to M68000 compatibility, but can't make it work. I'm suspecting that by (slow) byte copying method is incorrect, or perhaps the " lea _FC8_LENGTH_DECODE_LUT(pc),a5" that I'm not really getting. Any help would be very .. helpful!

Btw the only major things that I've changed are marked with "CHANGES HERE"

Nicolaas

Here's what I've tried to modify :

procedure Decompress(in, out : global pointer of integer; outsize : global long);
begin
;//#pragma parameter __D0 fc8_decode(__A0, __A1, __D1)
asm("
;asm unsigned short fc8_decode(unsigned char* in, unsigned char* out, unsigned long outsize)
;   machine 68020
FC8_HEADER_SIZE EQU 8
FC8_DECODED_SIZE_OFFSET  EQU 4
    movem.l d1-d7/a0-a6,-(sp)
    move.l compression_in,a0    
    move.l compression_out,a1   
    move.l compression_outsize,d1   
    move.l #32000,d1    

    bra     _Init_Decode

;// lookup table for decoding the copy length-1 parameter
_FC8_LENGTH_DECODE_LUT:
    dc.b    2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
    dc.b    18,19,20,21,22,23,24,25,26,27,28,34,47,71,127,255

;//-------------------------------------------------------------------------------
;// _GetUINT32 - alignment independent reader for 32-bit integers
;// a0 = in
;// d6 = offset
;// d7 = result
;//-------------------------------------------------------------------------------
_GetUINT32:
    move.b  (a0,d6.w),d7
    asl.w   #8,d7
    move.b  1(a0,d6.w),d7
    swap    d7
    move.b  2(a0,d6.w),d7
    asl.w   #8,d7
    move.b  3(a0,d6.w),d7
    rts

_Init_Decode:
    ;// check magic ID
    cmp.b   #'F',(a0)
    bne _fail
    cmp.b   #'C',1(a0)
    bne _fail
    cmp.b   #'8',2(a0)
    bne _fail
    cmp.b   #'_',3(a0)
    bne _fail

    ;// check decoded size - enough space in the output buffer?
    moveq   #FC8_DECODED_SIZE_OFFSET,d6
    bsr.s   _GetUINT32
    cmp.l   d7,d1               
    bhi _fail

    ;// advance a0 to start of compressed data
    lea     FC8_HEADER_SIZE(a0),a0

    ;// a5 = base of length decode lookup table
    lea     _FC8_LENGTH_DECODE_LUT(pc),a5
    ;lea        -110(pc),a5 ;// fix PC offset manually after assembly

    ;// helpful constants
    move.l  #$0000001F,d1
    move.l  #$00000007,d2
    move.l  #$00000001,d3
    move.l  #$0000003F,d4       ;// Main decompression loop
_mainloop:
    move.b  (a0)+,d6            ;// d6 = next token
    bmi.s   _BR1_BR2            ;// BR1 and BR2 tokens have the high bit set

    btst    #6,d6               ;// check bit 6 to see if this is a BR0 or LIT token
    bne.s   _BR0

;// LIT 00aaaaaa - copy literal string of aaaaaa+1 bytes from input to output   
_LIT:                           
    move.l  a0,a4               ;// a4 = source ptr for copy
    and.w   d4,d6               ;// AND with 0x3F, d6 = length-1 word for copy
    lea     1(a0,d6.w),a0       ;// advance a0 to the end of the literal string
    bra.s   _copyLoop

;// BR0 01baaaaa - copy b+3 bytes from output backref aaaaa to output
_BR0:                               
    move.b  d6,d5
    and.l   d1,d5               ;// AND with 0x1F, d5 = offset for copy = (long)(t0 & 0x1F)
    beq     _done               ;// BR0 with 0 offset means EOF

    move.l  a1,a4
    sub.l   d5,a4               ;// a4 = source ptr for copy

    move.b  (a4)+,(a1)+     ;// copy 3 bytes, can't use move.w. or move.l because src and dest may overlap
    move.b  (a4)+,(a1)+
    move.b  (a4)+,(a1)+
    btst    #5,d6               ;// check b bit
    beq.s   _mainloop
    move.b  (a4)+,(a1)+     ;// copy 1 more byte
    bra.s   _mainloop

_BR1_BR2:
    btst    #6,d6               ;// check bit 6 to see if this is a BR1 or BR2 token
    bne.s   _BR2

;// BR1 10bbbaaa'aaaaaaaa - copy bbb+3 bytes from output backref aaa'aaaaaaaa to output
_BR1:                           
    move.b  d6,d5
    and.l   d2,d5               ;// AND with 0x07 
    lsl.l   #8,d5
    move.b  (a0)+,d5            ;// d5 = offset for copy = ((long)(t0 & 0x07) << 8) | t1
    lsr.b   #3,d6
    and.w   d2,d6               ;// AND with 0x07
    addq.w  #2,d6               ;// d6 = length-1 word for copy = ((word)(t0 >> 3) & 0x7) + 1
    bra.s   _copyBackref

;// BR2 11bbbbba'aaaaaaaa'aaaaaaaa - copy lookup_table[bbbbb] bytes from output backref a'aaaaaaaa'aaaaaaaa to output
_BR2:   
    move.b  d6,d5
    and.w   d3,d5               ;// AND with 0x01
    swap    d5
    move.w  (a0)+,d5            ;// d5 = offset for copy = ((long)(t0 & 0x01) << 16) | (t1 << 8) | t2
    lsr.b   #1,d6
    and.w   d1,d6               ;// AND with 0x1F
    move.b  (a5,d6.w),d6        ;// d6 = length-1 word for copy = ((word)(t0 >> 3) & 0x7) + 1
    ;// fall through to _copyBackref

    ;// copy data from a previous location in the output buffer
    ;// d5 = offset from current buffer position    
_copyBackref:
    move.l  a1,a4
    sub.l   d5,a4               ;// a4 = source ptr for copy
    cmpi.l  #4,d5   
    blt.s   _nearCopy           ;// must copy byte-by-byte if offset < 4, to avoid overlapping long copies

    ;// Partially unrolled block copy. Requires 68020 or better.
    ;// Uses move.l and move.w where possible, even though both source and dest may be unaligned.
    ;// It's still faster than multiple move.b instructions
    ;// d6 = length-1
_copyLoop:  
    cmpi.w  #16,d6
    bge.s   _copy17orMore   
    jmp _copyBytes_leuat

; CHANGES HERE  

_copyBytes_leuat:
    cmp.w #0,d6
    beq.s _mainloop

;     subi.w #1,d6 ;; Copy 1 byte less

_topp_leuat:
  move.b (a4)+,(a1)+
  dbf d6,_topp_leuat
  bra       _mainloop

; CHANGES HERE

_copy17orMore:
        move.l  (a4)+,(a1)+
        move.l  (a4)+,(a1)+
        move.l  (a4)+,(a1)+
        move.l  (a4)+,(a1)+
        subi.w  #16,d6
        cmpi.w  #16,d6
        bge.s   _copy17orMore   
        jmp _copyBytes_leuat

_nearCopy:  
        cmpi.l  #1,d5
        beq.s   _copyRLE    
_nearLoop:
        move.b  (a4)+,(a1)+
        dbf     d6,_nearLoop
        bra     _mainloop

_copyRLE:
        ;// assumes length is at least 3
        move.b  (a4),(a1)+      ;// copy first byte
        btst    #0,d6
        beq.s   _doRLE          ;// branch if copy length is odd (because d6 is length-1)
        move.b  (a4),(a1)+      ;// copy second byte
_doRLE:
        subq.w  #2,d6
        lsr.w   #1,d6           ;// length = (length-2) / 2
        move.w  (a4),d5 
_rleLoop:
        move.w  d5,(a1)+
        dbf     d6,_rleLoop
        bra     _mainloop

_fail:
    moveq.w #0,d0               ;// result = 0
    bra     _exit   

_done:
    moveq.w #1,d0               ;// result = 1

_exit:  
    movem.l (sp)+,d1-d7/a0-a6
    ");
end;
steve-chamberlin commented 4 years ago

The details of this code have faded from my mind since four years ago. Since this is just vanilla 68K code, I'd suggest running your modified version in an emulator like Easy68K, and stepping through it while it performs the decompression to find the point where the output is wrong. From a quick glance, I think you need to modify the code at _copy17orMore to use move.b instead of move.l, because the address might be unaligned and I don't think the 68000 supports unaligned memory access. There may also be some unaligned memory access happening in _doRLE and _rleLoop. For best performance on the 68000, you probably need to check if the source and destination addresses are aligned, copy a few bytes individually with move.b if necessary, and then copy the rest with aligned move.w or move.l copies.

steve-chamberlin commented 4 years ago

The lea _FC8_LENGTH_DECODE_LUT(pc),a5 is intended to set a5 to point to the beginning of _FC8_LENGTH_DECODE_LUT table, in a position-independent way relative to the PC. I seem to recall I had some difficulty getting this to work right, so I inspected the assembled code or ran it in an emulator, and then manually set the -110 offset to make it work. If you've modified any code that's before that lea statement, then you'll probably need to change that -110 to another number.

leuat commented 4 years ago

Thanks for the feedback! If I don't get it working, I'll probably just re-implement the whole thing in TRSE (kinda slow!) and then convert each part to assembly. I'm guessing that the crashes is due to the alignment, so I'll start there.

steve-chamberlin commented 4 years ago

As a starting point to get it working, I think you could use _nearLoop for all the copying, and delete _copyLoop and its related code as well as deleting _copyRLE, _doRLE, and _rleLoop. nearLoop is the most basic form of copying, and everything else is just intended to optimize specific cases. But I'm now noticing there might also be unaligned access in a few other places, like move.w (a0)+,d5 in BR2.

leuat commented 4 years ago

hehe I almost got it working, but not quite yet... check out this video: https://www.irio.co.uk/div/almost.mp4 , but the image still has artifacts. Yeah I don't really care about speed right now (usually, decompressing a single resource - like an image - won't have that much penalty in terms of speed)

leuat commented 4 years ago

ok I'll try to remove those other copying mechanisms

leuat commented 4 years ago

got it! Works like a charm! Check out https://www.irio.co.uk/div/almost.mp4 (when the first effect transitions, it reverts to the raw uncompressed data and scrambles) - but - thanks a bunch! Slow as %(#/%, but who cares.

I've added support in TRSE for an automatic "compression" flag, so the user will be able to include any resource as such: .. myData : incbin("myImage.bin") compressed; // this will automatically compress +include the data

and within each M68K system, I've added a "compression" library so you can simply do "Compression::Decompress(myData, uncompressed_storage);"

Since I'm using (a modified version) of both your C++ compression / m68K decompression code, I'll add you to the list of credits of TRSE (again, www.turborascal.com). Also, I'll retain the header you have in the .h file. Let me know if I should add anything else

regards

Nicolaas

steve-chamberlin commented 4 years ago

Great, glad to hear it works. Have you read this: https://www.bigmessowires.com/2016/05/06/fc8-faster-68k-decompression/

FC8 is a balance between compression density and decompression speed. If decompression speed isn't very important for your purposes, then tighter compression is definitely possible. See the "tradeoffs" section of the linked blog post.

steve-chamberlin commented 4 years ago

Could you also submit your 68000 changes as a patch or a separate file? I'm sure others would benefit.

leuat commented 4 years ago

yup well I'll have to rewrite the library first, add some comments and make it more accessible. Also, since it is written as a library within TRSE, there will be some hybrid stuff - Pascal mixed with assembler. For for the most part, it should be simple to copy-paste. And yes, most of the decompression will be of images and demo effect data - stuff that can be done while the music is playing + the raster irq can do some simple effects, so no problem if the load time is 2s instead of 1 for an image. Would therefore love to have higher compression rates, especially since I'd like to keep things in RAM (as opposed to loading resources from disk).

leuat commented 4 years ago

The M68K decompression library (for the Atari ST) in TRSE can be found here : https://github.com/leuat/TRSE/blob/master/Publish/tutorials/ATARI520ST/tru/compression.tru