Closed kervinck closed 4 years ago
Draft and untested proposal:
#-----------------------------------------------------------------------
# Extension SYS_SetMemory_v2_54
#-----------------------------------------------------------------------
# SYS function for setting 1..256 bytes
#
# sysArgs[0] Copy count (destructive)
# sysArgs[1] Copy value
# sysArgs[2:3] Destination address (destructive)
#
# Sets up to 8 bytes per invocation before restarting itself through vCPU.
# Doesn't wrap around page boundary. Can run 3 times per 148-cycle time slice.
# Combined that gives a 3 times speedup over ROMv4 and before.
label('SYS_SetMemory_v2_54')
ld([sysArgs+0]) #15
bra('sys_SetMemory#18') #16
ld([sysArgs+2],X) #17
# SYS_SetMemory_54 implementation
label('sys_SetMemory#18')
ld([sysArgs+3],Y) #18
ble('.sysSm#21') #19
suba(8) #20
bge('.sysSm#23') #21
st([sysArgs+0]) #22
anda(4) #23 4 pixels
bne('.sysSm#26') #24
ld([sysArgs+1]) #25
st([Y,Xpp]) #26
st([Y,Xpp]) #27
st([Y,Xpp]) #28
bra('.sysSm#31') #29
st([Y,Xpp]) #30
label('.sysSm#26')
wait(31-26) #26
label('.sysSm#31')
ld([sysArgs+0]) #31
anda(2) #32 2 pixels
bne('.sysSm#35') #33
ld([sysArgs+1]) #34
st([Y,Xpp]) #35
bra('.sysSm#38') #36
st([Y,Xpp]) #37
label('.sysSm#35')
wait(38-35) #35
label('.sysSm#38')
ld([sysArgs+0]) #38 1 pixel
anda(1) #39
bne(pc()+3) #40
bra(pc()+3) #41
ld([sysArgs+1]) #42
ld([Y,X]) #42(!)
st([Y,X]) #43
ld(hi('NEXTY'),Y) #44
jmp(Y,'NEXTY') #45
ld(-48/2) #46
label('.sysSm#21')
suba(8) #21
st([sysArgs+0]) #22
label('.sysSm#23')
ld([sysArgs+1]) #23 8 pixels
st([Y,Xpp]) #24
st([Y,Xpp]) #25
st([Y,Xpp]) #26
st([Y,Xpp]) #27
st([Y,Xpp]) #28
st([Y,Xpp]) #29
st([Y,Xpp]) #30
st([Y,Xpp]) #31
ld(8) #32
adda([sysArgs+2]) #33
st([sysArgs+2]) #34
ld([sysArgs+0]) #35
bne(pc()+3) #36
bra(pc()+3) #37
ld(-2) #38
ld(0) #38(!)
adda([vPC]) #39
st([vPC]) #40
ld(hi('REENTER'),Y) #41
jmp(Y,'REENTER') #42
ld(-46/2) #43
Details in the above are off. But the concept works and is now integrated. https://github.com/kervinck/gigatron-rom/commit/3fd015886ecaa6adc91c9195ccf639dc48a7fb67
Without breaking the interface, we can go to 8 bytes per dispatch (instead of 4), and 3 dispatches per 148 cycle time slice (instead of 2). This gives a total speedup of 300%