Closed Mr-Bajs closed 2 years ago
It mostly depends on how game devs use the TIC resources, don't know what we could do here. Also, please try to test the nightly build, it should run a bit faster https://nightly.link/nesbox/TIC-80/workflows/build/master. Thanks
I am having the same experience, running TIC-80 through retroarch. The biggest problem so far has been with sprites - in bunnymark I get roughly 75 before the music starts to break. My uneducated suspicion is that each sprite issues its own draw call which quickly tanks performance.
version: 0.90.1748
EDIT: Looks like it couldn't be draw calls since drawSprite()
just blits the sprite directly into the framebuffer. Maybe the RG351M's CPU is weak enough for that to become a concern?
@nesbox I am using the retroarch tic80 core on the anbernic, but I don't see the core in the nightly releases, or instructions on how to build the libretro core for arm. How would I test the nightly build?
Maybe @RobLoach can answer the question :) He works with the retroarch build
Is that build through Lakka? How are you getting the tic-80 core? The hardware is pretty low, and 75 sprites is a pretty good amount 😋
I'm running Arkos. I got the tic80 core through retroarch. I feel like the hardware could do more than 75 sprites, but I could be wrong.
It is, but the level of indirection in our drawing code is... Let's take scale!=1 (bunny mark), which is worse case I think: (this is only C stack, and not counting any overhead in the languages runtime themselves)
This is likely easily 10-20x slower than what it would take to just move the pixels onto the raw hardware. But for all the flexibility we offer there is a reason for all (most?) of these abstraction layers.
Some things we could do to improve speed (like work with a byte buffer instead of nibbles) are impossible because of our "hardware" design and the fact that our VRAM is 4-bit. If someone was seriously interested in a project here there are probably wins to be had, but I'm not sure they would come easy or without adding further complexity.
Yes, I do realize the compiler likely optimizes some of this (esp the bottom).
@desttinghim If you change the bunny size to 1,1 instead of 2 or 3x how many sprites does it render per frame?
I think perhaps this is the fault of our games (and that we have no limits and don't set any expectations of what people should be doing)... I got curious and just dug into PICO-8 cost accounting: https://pico-8.fandom.com/wiki/CPU
For example you have "139,810 cycles per frame at 60 FPS"... and sprites cost per-pixel * 2... so an 8x8 sprites is 64 pixels, so 128 cycles.... so if look at bunnymark we bunny are 4x rendered, making each bunny 32x32... giving it a PICO-8 cycle cost of 2,048. If we just plug this into PICO-8 we'd find we could have 68 bunnies per frame at 100% CPU burn. (that's assuming zero cost for all the Lua code itself)
and 75 sprites is a pretty good amount
So I think based on PICO-8 at least that this statement isn't too far off. I can draw 700-800 bunnies on my laptop (Debug build) at 60 fps.
From maybe 10 games that I tried only Witch 'em Up and Cauliflower Power runs smooth.
I wonder if you couldn't provide a bit more detail about which exact games - and could anyone else on the platform confirm the same slow results? What about playing panda
, is it also slow? I guess I'm just wondering WHY they are so slow... and saying that 75 sprites per frame isn't "slow"... are they slow because they are using 200 sprites, or for some other reason?
@desttinghim If you change the bunny size to 1,1 instead of 2 or 3x how many sprites does it render per frame?
Scale is 1, in the lua example at least, so I may be misunderstanding you. I did try changing the W and H from 4 to 1, which does let me get to around 450-500 sprites drawn at a time, but that dies decrease the required iterations to 1/16 of the original benchmark.
If we make the assumption that most people rarely read from the framebuffer, we could probably batch draw requests until the end of the frame (or read) and do all the updates at once. This would allow utilizing the GPU for drawing. It might even allow optimization on the CPU if we use the batching to render sprites in reverse order and use a depth buffer to prevent overdraw. This would be more overhead in general, but for games that heavily use sprites it should give better FPS. Though I'm mostly spitballing at the moment.
If we make the assumption that most people rarely read from the framebuffer
That's a huge assumption and one I personally wouldn't be prepared to make. A lot of effects can be achieved by drawing to the buffer, then reading it back with peek/poke (at larger scale and doing specific transforms)... that requires the data to be in the buffer. If the buffer was "write only" there would be room for more optimizations because we wouldn't need to store it in 4-bit format.
Remember, this is all virtual. You're not even writing to the real frame buffer (because that frame buffer is 24/32-bit color for the actual hardware, etc)... you're just toggling nibbles of RAM in the TIC-80 address space - that's all any of the draw commands do. After TIC
is finished the the entire buffer is translated to the actual 24/32-bit color frame buffer and then finally sent to the actual GPU. I'd imagine that part is crazy fast already.
batch draw requests until the end of the frame (or read)
"until read" is a slightly more interesting thought.
Though I'm mostly spitballing at the moment.
Really it's all spitballing until someone profiles where the actual hot paths are at or we have better benchmarks.
I think just 'making it faster' might be the wrong goal - then people will just build more awesome and slower games (that's software life cycle). I think we really should consider having a fixed speed so that carts perform the same everywhere. If someone is developing on their 20GHz M3 Mac the game should play the same fast 60FPS as on a handheld... it shouldn't be 200 times faster. But that would require deciding what these limits are... So the question might not be "how many can we draw" but rather " how many is reasonable?
That said it's possible something is very broken on that device if none of the games run well...
Might be helpful if we have a consistent benchmark people could run across devices to get a TIC-80 score of some kind. :-)
(more than just bunnies since it's kind of subjective in some ways)
With the changes I made to bunnymark I got like 20000 sprites on my ryzen 3600 desktop
The benchmark I'm imagining would have some idea of what a device SHOULD be capable of... so you'd run it and it'd tell you if your device could run cartridges well or not (and within what margin). Of course that's impossible to know if we don't artificially limit the cartridges. :-)
I imagine you could write an (artificial) game right now that would work on your powerful desktop but not on my laptop...
I guess all I'm really saying is we need to know the # of bunnies (or some mixed metric)... so you run it and get 75 bunnies and we go "that's great!, 50 is the standard".... or 'oh crap, somethings is wrong, you need 120 bunnies to run most games"... it doesn't matter if you can draw 100 or 100,000 if 50 was the gold standard. :)
Right now it seems we really don't know - just that that device seems slow.
Maybe the RG351M's CPU is weak enough for that to become a concern?
I just looked at the libretro code and all we do is pass our RGB1555
buffer directly to libretto's screen render callback... so that should take like no time at all... so the slowdown you're seeing is definitely all CPU burn.
Ok, this is slightly off topic now but I've been looking into where I got my copy of the TIC-80 retroarch core and I thought I'd document it. @RobLoach this answers your question from before.
ArkOS configures retroarch to gets its binaries from https://github.com/christianhaitian/retroarch-cores
That in turn appears to be built with https://github.com/christianhaitian/rk3326_core_builds
It looks like it uses a qemu container to do cross compilation. The latest commit for the tic80 core says it is based on 1ae3c5
That's a libretro hash, but if it's the release as it says then tracing that back to origin 0.90.1723 tag:
commit 9c38a8063081605e7265069bf9c731c090f2e841 (tag: v0.90.1723, tag: 0.90.1723)
Author: nesbox <grigoruk@gmail.com>
Date: Fri Jul 23 13:26:46 2021 +0300
preparation for release
Then it's from July of last year.
Okay, I managed to get a cross-compiler working by installing the aarch64-linux-gnu-gcc
package (Arch Linux) and using this toolchain file:
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++)
set(CMAKE_FIND_ROOT_PATH /usr/aarch64-linux-gnu)
set(CMAKE_INCLUDE_PATH /usr/aarch64-linux-gnu/include)
set(CMAKE_LIBRARY_PATH /usr/aarch64-linux-gnu/lib)
set(CMAKE_PROGRAM_PATH /usr/aarch64-linux-gnu/bin)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
Then (from build
)
cmake -DBUILD_PLAYER=OFF -DBUILD_SOKOL=OFF -DBUILD_SDL=OFF -DBUILD_DEMO_CARTS=OFF -DBUILD_LIBRETRO=ON ..
And make
.
I don't think there's a difference in performance. I've figured out how to display FPS in retroarch, so I'll start listing what I'm seeing here.
EDIT: It looks like Bunnymark starts dropping frames even sooner than I thought. Around 30 or so.
I wonder if you tried building with -Os
vs -O2
vs -O3
if that would make any difference? Can you confirm you're using the best/correct march
flags for your CPU?
Also, pls don't forget to add -DCMAKE_BUILD_TYPE=MinSizeRel
or -DCMAKE_BUILD_TYPE=Release
options to get best performance.
Oh, I found us 15% more performance too - that should help with some of those numbers close to 60. PR forthcoming. ;-)
which does let me get to around 450-500 sprites drawn at a time,
Ok (not that I'm thinking about it), that's slow I think (in an absolute sense)... meaning for whatever reason your hardware is performing quite slowly IMHO. I just opened up Panda to read the source... and it has paralax backgrounds... It only draws from the top of the mountains down, but that's 70% of the screen so you're [worst case] painting 17*30*1.8
... ~900 sprites per frame or so (for the map/bg)...
I don't think essentially painting the background twice should make a game drop frames on slower hardware... so if you count two layer BG + foreground sprites I think 1,200 minimum would be a good bottom rung.
Can you run benchmark.tic
on your hardware and get the numbers for the various categories?
I haven't yet tried building in release mode, so this may be part of the problem. I don't think the Anbernic build is using the Release option (https://github.com/christianhaitian/rk3326_core_builds/blob/main/scripts/tic-80.sh#L45) so this may be the cause of the issue in the first place.
Yeah, makes you wonder if it should perhaps be default. :-)
this may be the cause of the issue in the first place.
Likely. My release build hits 1200 bunnies (max my Zig can draw) without breaking a sweat when it was stuck at 700 in debug mode...
Submitted a change to Lakka to build in Release: https://bit.ly/3Fd0w5k .... Unsure if LibreELEC's build scripts do that by default, but :shrug:
This was definitely the issue. Getting a solid 60FPS for 8 Bit Panda and Supernova now! @joshgoebel where is this benchmark.tic
file?
oh maybe you need to run demo
or demos
to get it?
Oh, I see it now.
This was definitely the issue.
Awesome. Closing this issue out.
Just reporting that I built a new tic-80 core using the build PR from @joshgoebel and I tested using Updog. Performance is definitely much better now. New core has been added to the repo. Thanks for the PR.
Awesome... so glad we could help!
What's Updog? :wink:
Cool tic-80 game. I switch between it and Caulipower. https://tic80.com/play?cart=1397
I just found out about TIC-80 so i downloaded a few games from https://tic80.com added it into the folder on the SD cards. But most games just studders terribly or slow motion.
From maybe 10 games that I tried only Witch 'em Up and Cauliflower Power runs smooth.
v. 0.90.1748