nesbox / TIC-80

TIC-80 is a fantasy computer for making, playing and sharing tiny games.
https://tic80.com
MIT License
5.04k stars 489 forks source link

Anbernic RG351M with ArkOS - most games lag terribly #1739

Closed Mr-Bajs closed 2 years ago

Mr-Bajs commented 2 years ago

I just found out about TIC-80 so i downloaded a few games from https://tic80.com added it into the folder on the SD cards. But most games just studders terribly or slow motion.

From maybe 10 games that I tried only Witch 'em Up and Cauliflower Power runs smooth.

v. 0.90.1748

nesbox commented 2 years ago

It mostly depends on how game devs use the TIC resources, don't know what we could do here. Also, please try to test the nightly build, it should run a bit faster https://nightly.link/nesbox/TIC-80/workflows/build/master. Thanks

desttinghim commented 2 years ago

I am having the same experience, running TIC-80 through retroarch. The biggest problem so far has been with sprites - in bunnymark I get roughly 75 before the music starts to break. My uneducated suspicion is that each sprite issues its own draw call which quickly tanks performance.

version: 0.90.1748

EDIT: Looks like it couldn't be draw calls since drawSprite() just blits the sprite directly into the framebuffer. Maybe the RG351M's CPU is weak enough for that to become a concern?

desttinghim commented 2 years ago

@nesbox I am using the retroarch tic80 core on the anbernic, but I don't see the core in the nightly releases, or instructions on how to build the libretro core for arm. How would I test the nightly build?

nesbox commented 2 years ago

Maybe @RobLoach can answer the question :) He works with the retroarch build

RobLoach commented 2 years ago

Is that build through Lakka? How are you getting the tic-80 core? The hardware is pretty low, and 75 sprites is a pretty good amount 😋

desttinghim commented 2 years ago

I'm running Arkos. I got the tic80 core through retroarch. I feel like the hardware could do more than 75 sprites, but I could be wrong.

joshgoebel commented 2 years ago

It is, but the level of indirection in our drawing code is... Let's take scale!=1 (bunny mark), which is worse case I think: (this is only C stack, and not counting any overhead in the languages runtime themselves)

This is likely easily 10-20x slower than what it would take to just move the pixels onto the raw hardware. But for all the flexibility we offer there is a reason for all (most?) of these abstraction layers.

Some things we could do to improve speed (like work with a byte buffer instead of nibbles) are impossible because of our "hardware" design and the fact that our VRAM is 4-bit. If someone was seriously interested in a project here there are probably wins to be had, but I'm not sure they would come easy or without adding further complexity.

Yes, I do realize the compiler likely optimizes some of this (esp the bottom).

joshgoebel commented 2 years ago

@desttinghim If you change the bunny size to 1,1 instead of 2 or 3x how many sprites does it render per frame?

joshgoebel commented 2 years ago

I think perhaps this is the fault of our games (and that we have no limits and don't set any expectations of what people should be doing)... I got curious and just dug into PICO-8 cost accounting: https://pico-8.fandom.com/wiki/CPU

For example you have "139,810 cycles per frame at 60 FPS"... and sprites cost per-pixel * 2... so an 8x8 sprites is 64 pixels, so 128 cycles.... so if look at bunnymark we bunny are 4x rendered, making each bunny 32x32... giving it a PICO-8 cycle cost of 2,048. If we just plug this into PICO-8 we'd find we could have 68 bunnies per frame at 100% CPU burn. (that's assuming zero cost for all the Lua code itself)

and 75 sprites is a pretty good amount

So I think based on PICO-8 at least that this statement isn't too far off. I can draw 700-800 bunnies on my laptop (Debug build) at 60 fps.

From maybe 10 games that I tried only Witch 'em Up and Cauliflower Power runs smooth.

I wonder if you couldn't provide a bit more detail about which exact games - and could anyone else on the platform confirm the same slow results? What about playing panda, is it also slow? I guess I'm just wondering WHY they are so slow... and saying that 75 sprites per frame isn't "slow"... are they slow because they are using 200 sprites, or for some other reason?

desttinghim commented 2 years ago

@desttinghim If you change the bunny size to 1,1 instead of 2 or 3x how many sprites does it render per frame?

Scale is 1, in the lua example at least, so I may be misunderstanding you. I did try changing the W and H from 4 to 1, which does let me get to around 450-500 sprites drawn at a time, but that dies decrease the required iterations to 1/16 of the original benchmark.

If we make the assumption that most people rarely read from the framebuffer, we could probably batch draw requests until the end of the frame (or read) and do all the updates at once. This would allow utilizing the GPU for drawing. It might even allow optimization on the CPU if we use the batching to render sprites in reverse order and use a depth buffer to prevent overdraw. This would be more overhead in general, but for games that heavily use sprites it should give better FPS. Though I'm mostly spitballing at the moment.

joshgoebel commented 2 years ago

If we make the assumption that most people rarely read from the framebuffer

That's a huge assumption and one I personally wouldn't be prepared to make. A lot of effects can be achieved by drawing to the buffer, then reading it back with peek/poke (at larger scale and doing specific transforms)... that requires the data to be in the buffer. If the buffer was "write only" there would be room for more optimizations because we wouldn't need to store it in 4-bit format.

Remember, this is all virtual. You're not even writing to the real frame buffer (because that frame buffer is 24/32-bit color for the actual hardware, etc)... you're just toggling nibbles of RAM in the TIC-80 address space - that's all any of the draw commands do. After TIC is finished the the entire buffer is translated to the actual 24/32-bit color frame buffer and then finally sent to the actual GPU. I'd imagine that part is crazy fast already.

batch draw requests until the end of the frame (or read)

"until read" is a slightly more interesting thought.

Though I'm mostly spitballing at the moment.

Really it's all spitballing until someone profiles where the actual hot paths are at or we have better benchmarks.

joshgoebel commented 2 years ago

I think just 'making it faster' might be the wrong goal - then people will just build more awesome and slower games (that's software life cycle). I think we really should consider having a fixed speed so that carts perform the same everywhere. If someone is developing on their 20GHz M3 Mac the game should play the same fast 60FPS as on a handheld... it shouldn't be 200 times faster. But that would require deciding what these limits are... So the question might not be "how many can we draw" but rather " how many is reasonable?

That said it's possible something is very broken on that device if none of the games run well...

Might be helpful if we have a consistent benchmark people could run across devices to get a TIC-80 score of some kind. :-)

(more than just bunnies since it's kind of subjective in some ways)

desttinghim commented 2 years ago

With the changes I made to bunnymark I got like 20000 sprites on my ryzen 3600 desktop

joshgoebel commented 2 years ago

The benchmark I'm imagining would have some idea of what a device SHOULD be capable of... so you'd run it and it'd tell you if your device could run cartridges well or not (and within what margin). Of course that's impossible to know if we don't artificially limit the cartridges. :-)

I imagine you could write an (artificial) game right now that would work on your powerful desktop but not on my laptop...

I guess all I'm really saying is we need to know the # of bunnies (or some mixed metric)... so you run it and get 75 bunnies and we go "that's great!, 50 is the standard".... or 'oh crap, somethings is wrong, you need 120 bunnies to run most games"... it doesn't matter if you can draw 100 or 100,000 if 50 was the gold standard. :)

Right now it seems we really don't know - just that that device seems slow.

joshgoebel commented 2 years ago

Maybe the RG351M's CPU is weak enough for that to become a concern?

I just looked at the libretro code and all we do is pass our RGB1555 buffer directly to libretto's screen render callback... so that should take like no time at all... so the slowdown you're seeing is definitely all CPU burn.

desttinghim commented 2 years ago

Ok, this is slightly off topic now but I've been looking into where I got my copy of the TIC-80 retroarch core and I thought I'd document it. @RobLoach this answers your question from before.

ArkOS configures retroarch to gets its binaries from https://github.com/christianhaitian/retroarch-cores That in turn appears to be built with https://github.com/christianhaitian/rk3326_core_builds It looks like it uses a qemu container to do cross compilation. The latest commit for the tic80 core says it is based on 1ae3c5

joshgoebel commented 2 years ago

That's a libretro hash, but if it's the release as it says then tracing that back to origin 0.90.1723 tag:

commit 9c38a8063081605e7265069bf9c731c090f2e841 (tag: v0.90.1723, tag: 0.90.1723)
Author: nesbox <grigoruk@gmail.com>
Date:   Fri Jul 23 13:26:46 2021 +0300

    preparation for release

Then it's from July of last year.

desttinghim commented 2 years ago

Okay, I managed to get a cross-compiler working by installing the aarch64-linux-gnu-gcc package (Arch Linux) and using this toolchain file:

set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(CMAKE_C_COMPILER    aarch64-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER  aarch64-linux-gnu-g++)

set(CMAKE_FIND_ROOT_PATH    /usr/aarch64-linux-gnu)
set(CMAKE_INCLUDE_PATH      /usr/aarch64-linux-gnu/include)
set(CMAKE_LIBRARY_PATH      /usr/aarch64-linux-gnu/lib)
set(CMAKE_PROGRAM_PATH      /usr/aarch64-linux-gnu/bin)

set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)

set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)

Then (from build)

cmake -DBUILD_PLAYER=OFF -DBUILD_SOKOL=OFF -DBUILD_SDL=OFF -DBUILD_DEMO_CARTS=OFF -DBUILD_LIBRETRO=ON ..

And make.

I don't think there's a difference in performance. I've figured out how to display FPS in retroarch, so I'll start listing what I'm seeing here.

EDIT: It looks like Bunnymark starts dropping frames even sooner than I thought. Around 30 or so.

joshgoebel commented 2 years ago

I wonder if you tried building with -Os vs -O2 vs -O3 if that would make any difference? Can you confirm you're using the best/correct march flags for your CPU?

nesbox commented 2 years ago

Also, pls don't forget to add -DCMAKE_BUILD_TYPE=MinSizeRel or -DCMAKE_BUILD_TYPE=Release options to get best performance.

joshgoebel commented 2 years ago

Oh, I found us 15% more performance too - that should help with some of those numbers close to 60. PR forthcoming. ;-)

joshgoebel commented 2 years ago

which does let me get to around 450-500 sprites drawn at a time,

Ok (not that I'm thinking about it), that's slow I think (in an absolute sense)... meaning for whatever reason your hardware is performing quite slowly IMHO. I just opened up Panda to read the source... and it has paralax backgrounds... It only draws from the top of the mountains down, but that's 70% of the screen so you're [worst case] painting 17*30*1.8... ~900 sprites per frame or so (for the map/bg)...

I don't think essentially painting the background twice should make a game drop frames on slower hardware... so if you count two layer BG + foreground sprites I think 1,200 minimum would be a good bottom rung.

joshgoebel commented 2 years ago

Can you run benchmark.tic on your hardware and get the numbers for the various categories?

desttinghim commented 2 years ago

I haven't yet tried building in release mode, so this may be part of the problem. I don't think the Anbernic build is using the Release option (https://github.com/christianhaitian/rk3326_core_builds/blob/main/scripts/tic-80.sh#L45) so this may be the cause of the issue in the first place.

joshgoebel commented 2 years ago

Yeah, makes you wonder if it should perhaps be default. :-)

this may be the cause of the issue in the first place.

Likely. My release build hits 1200 bunnies (max my Zig can draw) without breaking a sweat when it was stuck at 700 in debug mode...

RobLoach commented 2 years ago

Submitted a change to Lakka to build in Release: https://bit.ly/3Fd0w5k .... Unsure if LibreELEC's build scripts do that by default, but :shrug:

desttinghim commented 2 years ago

This was definitely the issue. Getting a solid 60FPS for 8 Bit Panda and Supernova now! @joshgoebel where is this benchmark.tic file?

joshgoebel commented 2 years ago

oh maybe you need to run demo or demos to get it?

desttinghim commented 2 years ago

Oh, I see it now.

joshgoebel commented 2 years ago

This was definitely the issue.

Awesome. Closing this issue out.

christianhaitian commented 2 years ago

Just reporting that I built a new tic-80 core using the build PR from @joshgoebel and I tested using Updog. Performance is definitely much better now. New core has been added to the repo. Thanks for the PR.

joshgoebel commented 2 years ago

Awesome... so glad we could help!

RobLoach commented 2 years ago

What's Updog? :wink:

christianhaitian commented 2 years ago

Cool tic-80 game. I switch between it and Caulipower. https://tic80.com/play?cart=1397