eminence commented 13 years ago

One of the reasons minecraft shifted from the old chunk format to the new region format was to reduce the number of small files that Minecraft had to open and read. Since Overviewer still does the same thing (creating a cache image for each chunk), we too sould see a speed boost if we changed this. I think we have two options:

Bigger chunk cache images. something bigger than we use now (1 chunk per image). We could probably go all the way up to 1024 chunks (32x32, the same size as regions), but this may be too large
Get rid of the cache entirely, and render directly to tiles.

agrif commented 13 years ago

I vote rendering directly to tiles. Per-chunk rendering makes very little sense after the MCRegion switch, but it's hard to imagine that tile-rendering would ever have the same problem. It would move Overviewer from a two-pass renderer to one-pass, and as a bonus it consolidates all of Overviewer's output in one place. This is a good thing for multi-layer rendering, because we no longer have to worry about maintaining a separate cache for lighting, night, spawn, ...

If we keep the separate cache, we should start putting the map type into the cache filenames.

Either way this will be a large change to the rendering code, so look out for ways to change the "north" direction or maybe factor different render modes out into subclasses.

eminence commented 13 years ago

if we drop the intermediate chunk cache, how do we tell if a tile needs to be updated? multiple chunks will be part of a tile, so using a simple mtime check is not straightforward.

can we use some type of database to record the mtime of each chunk when we render it, so we know when to re-render it?

agrif commented 13 years ago

I was thinking you could just iterate over the chunks that make up each tile, and rerender the tile if any of their times are newer than the tile's time.

I just remembered, though, that Overviewer uses a hash check because there are a lot of chunk updates that don't affect rendering. I'm not sure what the performance difference is without hashes and with, but if it's important then I guess we'd need a database with hashes and render times for each chunk.

Fenixin commented 13 years ago

I vote direct to tiles. An image of 32x32 chunks will have a resolution of 12480x7872, and if a chunk changes you need to re-render all the chunks, or at least all the chunks below it.

About the way to check if we should render or not a chunk: I think we can first check the mtime of the region file, and if it is older than the stored one or older than the data base (as seen here) you can just pass the check of that 1024 chunks.

In the data base we can store: mtimes of region files, mtimes of chunks (as stored in the region file) and hashes of chunks. I this way we can iterate as agrif says and if the mtime is older don't check hashes, but if it's newer you can still check the hashes before render the whole tile with all the chunks. Hashes can contain light-info + blocks-info. What do you think?

And yes, is a good moment to rethink everything for a north orientation, and lower zoom maps. I have some ideas for north orientation in chunk.py and textures.py.

agrif commented 13 years ago

I'm not entirely clear on why we would need to store region mtimes in a database. Surely reading that off the filesystem isn't that slow? And reading the chunk mtimes shoud also be fast if we can optimize the checks so they only open each region file once -- they're stored, uncompressed, right near the top of the file in a list. Calling get_chunks() on a MCRFileReader will read them all in with only one seek, and cache them for calls to get_chunk_timestamp(x, y).

I might be overlooking something obvious though. : /

Fenixin commented 13 years ago

Mmmh... maybe I'm who is overlooking something obvious. I want to store the region mtimes to look if the mtimes have changed in the next update of the map. You can't compare with an old mtime if you don't have it! no?

For chunks, if we use tiles as a cache we don't have a file for every chunk and we don't have an old mtime for every chunk, comparing mtimes is faster than doing hashes. Is there other way to know the previous mtime of the chunk?

Maybe, for the chunks, we can use the mtime of the tile? and re-render it if one of the chunks in the tile has newer mtime.

Sorry if I'm wrong.

eminence commented 13 years ago

no, i think you're right agrif, i didn't think this all the way through. it's goes something like this i think:

when you render a tile, you set the mtime of the tile to the current time. when you next run Overviewer, you want to render any chunks that have an mtime that is newer than the mtime of the tile it would render to.

i had originally though that it was possible for a chunk to be updated but have a timestamp that was still older than the mtime of the tile, but i don't think this is actually possible.

agrif commented 13 years ago

Yeah, eminence got what I was thinking. Maybe I should've explained my idea better. :P

When you render the tile, it's mtime will always be greater than all the chunks that it contains. You can tell when it needs to be rerendered because one of the chunks will have a later mtime than the tile. This is how unix 'make' handles file dependencies, BTW.

Fenixin commented 13 years ago

agh... sorry if I'm slow with some stuff, my english and my programmer skills are not really good (yet).

Let my check that I understood this. So, the idea is to iterate over all the tiles checking the mtimes of the chunks with the mtimes of the tiles, no? If so, that can be really expensive process for an update of a big map, imagine a 100 regions files map with 7 region files changed, to update it you still have to iterate over all the tiles and all the region files to look for the changed chunks. That's the point to store the mtimes of the region files, to check before start which regions have changed. Am I wrong or missing anything else?

agrif commented 13 years ago

No problem. On my part, I'm not exactly great at explaining things. :)

Right now, Overviewer already iterates over all the tiles in quadtree.py, where it checks the mtimes of the rendered chunk images. I can't imagine the mtime checks take that much time compared to the renders and image saving, but just as a quick test I wrote a python program that takes the mtime of all the files in my home directory (you can find it here.)

agrif@nara:~/devel$ python mtime_test.py ~/
72078 files in 1.695740 seconds: 42505.337291 files/second
72078 files in 1.619419 seconds: 44508.552538 files/second
72078 files in 1.618916 seconds: 44522.389716 files/second
...

So on a map with 10 zoom levels (including the base.png level) that's ~350000 tiles at 42000 files/second, or about 8.3 seconds of total mtime checking.

I'd like to get hard numbers out of Overviewer itself, but I really don't think checking mtimes is that expensive.

Fenixin commented 13 years ago

Thanks a lot! You convinced me :)

agrif commented 13 years ago

I've started reorganizing some code to implement the method I described. I've already changed WorldRenderer (now just World) and QuadtreeGen to handle the tile rendering and mtime tracking, and now all that's left is changing ChunkRenderer to draw directly to the tile image instead of its own image. Then we can see how the render times compare!

I hope to push this code to a new branch on agrif/Minecraft-Overviewer sometime later tonight.

agrif commented 13 years ago

Alright, I've got a simple version working in my direct-to-tiles branch (edit: this commit, specifically). It only supports plain, nonlit, nonspawn, noncave rendering so far, but adding those in is just a matter of passing the flags to the QuadtreeGen class. Also, chunklists and eminence's signpost scanning code isn't working yet either, and that will take a little more effort.

Here's a table of render times for a quick test on my tiny little world, with no cache, after a small change, and then with no change at all:

type        direct-to-tiles    vs.    master      diff
------------------------------------------------------

no cache    1m 42.8s                  2m 15.0s     76%
update      1m 40.6s                  1m 31.5s    110%
no change      4.69s                     4.56s    103%

The update times are not as good as I'd have liked, but the initial render time is pretty nice, and I have some ideas of where to improve all the times. This was a fairly quick set of changes, after all; there was a pretty nasty bug where each chunk would be rendered twice so there may be more of those floating around in there, too. I'd also like to see how these times scale to larger maps than my dinky little 4-region map. Even if I can't get the times any better, we'd have at least explored the idea and could come up with something better.

One advantage to this code is that it's way less complicated. There's no cache or hashes to maintain (besides the output tiles themselves) and I can already see easy ways to add in map layers and a configurable north.

eminence commented 13 years ago

agrif, i haven't had a chance to read through all of your code, but thanks a lot for doing this. having concrete code to look at and test with is really important when trying out new things.

a couple of thoughts off the top of my head:

it's generally easier to speed up simple code than complex code, so this is a good start
if we wanted to re-implement some of the image manipulation stuff in C, i think that would now be easier
what about getting rid of the processing pool, and using our own local thread pool? does this have any advantages or disadvantages? (i think objects are pickled/unpicked across processes. i'm wondering if this cost is significant. switching to threads may have negative implications if/when we want multi-node rendering. the GIL may also get in the way)

i'll take a closer look later this week. thanks again for prototyping this

Fenixin commented 13 years ago

@agrif

Just tested it and works great. If you want a bigger map to test new stuff I can send you a 39 regions world.

Xon commented 13 years ago

@agrif, using your mtime_test.py script, I created ~253k files and then ran it to benchmark it. To remove disk latency this ran off a tmpfs partition.

With 253416 0 byte files on a 1gb tmpfs results in;

python mtime_test.py . 253416 files in 15.407002 seconds: 16448.105897 files/second 253416 files in 14.797913 seconds: 17125.117489 files/second 253416 files in 15.369352 seconds: 16488.398616 files/second 253416 files in 15.530842 seconds: 16316.951710 files/second

Sure it's a VM, but it's running on a xeon 5150 2.66ghz CPU with 4gb of ram.

Just altering your benchmarking app to read the entire directory and write filepaths & mtimes into a CSV results in a 11mb store and the following benchmark for reading the CSV; 253412 lines in 0.324746 seconds

agrif commented 13 years ago

@Xon : There are a few different ideas to check to improve render time, and certainly it's worth looking into using a table of mtimes, especially since my more straightforward solution didn't turn out as well as I'd hoped.

Unfortunately, storing the times on the disk as a CSV doesn't make much sense. We can't tell whether the CSV is still up-to-date without scanning all the region files again, so we might as well just scan them right at the beginning and forget the CSV. There's no getting around it: if you want to know what regions have been updated, you have to scan all the region files.

(of course, the --chunklist or --regionlist options or whatever will bypass this whole mtime business entirely, once those are re-implemented)

I'm a little suprised by your times, though. It's significantly slower than what I got from disk, and I'd have thought it was I/O bound.

@Fenixin : If you could, that would be great, actually. I used to have a large map someone gave me back when I did lighting, but I seem to have lost it : (.

Xon commented 13 years ago

Yeah I was surprised too. I didn't think there was enough memory pressure to cause paging of the tmpfs partition. I'm still trying to figure it out, but whatever it is is causing a noticeable percentage of system time when gmap is running.

TBH, when you removed the caching code you gutted a lot of extra mtime checks.

I've added some mcr caching which has a nice improvement, and am looking at how to correctly implement a chunk cache to reduce the disk loads, but with access mcr caching there isn't a new mcr reader every chunk load which greatly reduces the number of IO requests. Once I've figured out gits, i'll stash these changes into my repos.

Inlining iterate_chunkblocks makes an amazing difference. This is an ideal candidate for rewriting in C or converting to something which will do back-face culling so it doesn't need to touch all 128x16x16 blocks all the time.

Not inlined: png of call graph profiling file 1255.526 CPU seconds Inlined: png of call graph profiling file 324.535 CPU seconds

There is still a crazy amount of system cpu time biasing my results, and I'm going to blame hyper-v & centos

eminence commented 13 years ago

on my machine, using real files:

$ python mtime_test.py /usr/share/doc
51863 files in 69.778156 seconds: 743.255525 files/second
51863 files in 2.237698 seconds: 23176.942639 files/second
51863 files in 2.146634 seconds: 24160.149117 files/second
51863 files in 2.153336 seconds: 24084.954155 files/second
51863 files in 2.095312 seconds: 24751.923803 files/second

note the effect of the system level stat caching

I think that agrif just has a fast machine if he can get 40k per sec :)

i also support the idea of profiling and re-writing bottlenecks in C. I think this is the only way to get fast enough for the really large maps (i mean the ones with 1million chunks or more)

also, those are really nice looking call graphs. what did you use to generate them?

agrif commented 13 years ago

Ah, I see. Maybe I should have tested on more directories than just my home dir.

@Xon : Those numbers are impressive! I look forward to trying out your code.

Fenixin commented 13 years ago

@Xon, couldn't wait for your code a tried it for myself. With _iterate_chunkblocks inlined the chunk render time goes down a 70% for me (from ~4 to ~1 minutes). Great idea!

[edit] I also want to know which program generate these cool call graphs.

[edit2] agh... I did it wrong and the chunks were rendering incomplete... so don't take this serious, sorry!

Xon commented 13 years ago

agrid, I suspect you just had a fast CPU. xeon 5150 is basicly just a server version of the c2d e6700.

@Fenixin, @eminence It uses stats generated by cProfile, gprof2dot.py script to generate the graph and then dot to render the graph into a pdf using the following command line; "python gprof2dot.py gmap.prof -f pstats | dot -Tpng -ogmap.png"

The entire line was; rm -r /var/www/minecraft_map/map2 ; python -m cProfile -o gmap.prof gmap.py -p1 /var/www/minecraft_map/smp1-world2 /var/www/minecraft_map/map2 ; python gprof2dot.py gmap.prof -f pstats | dot -Tpng -ogmap.png

This cleans the generated tiles, invokes the gmap.py without any modifications for profiling, then converts the profiling results intoa graph and then a png.

/var/www/minecraft_map is a tmpfs partition, and smp1-world2 is just a copy of a utterly tiny world(it's 4 chunks).

I then manually rename the png & profiling file to reflect what I'm benchmarking.

eminence commented 13 years ago

@Xon, have you written any of that region caching code? If not, I was going to start that tonight (if so, I'll wait until you push, as to not duplicate effort).

Also, what exactly did you mean by "inlining iterate_chunkblocks" ? I thought that you might have meant using "list(iterate_chunkblocks(xoff,yoff))" but I just tried this, i didn't see any improvement in performance

Xon commented 13 years ago

@eminence, I've written a bit. But I need to verify it will work correctly in multi-process mode.

As for inlining, it was a case of taking the contents of the iterate_chunkblocks (the 3 for loops) and copying them into chunk.chunk_render like so; for x in xrange(15,-1,-1): for y in xrange(16): imgx = xoff + x_12 + y_12 imgy = yoff - x_6 + y_6 + 128_12 + 16_12//2 for z in xrange(128):

yield x,y,z,imgx,imgy

            #for x,y,z,imgx,imgy in iterate_chunkblocks(xoff,yoff):
                        # make sure we're drawing within the image boundaries
                        ....
                        imgy -= 12

It's not remotely elegant or nice, but it does demonstrate how much that bit is slowing stuff down. Btw, making it return a list is even slower (but returning a list in generated by C code might work).

I think something like a quadtree might be the way forward we where track dependencies so every block doesn't need to be scanned (unless they are all translucent blocks). After all, we don't need to render everything under the ground if there is no way to see it.

eminence commented 13 years ago

wow, that really does help a whole lot. i have to admit, i really have no idea why. do you have any insight? i guess it's all the yields?

Xon commented 13 years ago

Function call overhead + the crazy number of calls. It's a min of 32768 calls per call of chunk_render.

But I think it's more noticable in the direct-to-tile render as chunk_render is being called way too many times. On the tiny map I created it's been called 5004 times, but it only has 625 chunks. It looks like it's been called for each tile on most of the zoom levels.

agrif commented 13 years ago

In my branch, chunk_render is called once for each chunk visible on the tile, per tile on the lowest zoom. It'll end up being called way more often than before. To prevent doing too much work, I added a check to chunk_render so it skips blocks that are off the edge of the tile. Only a small portion of each chunk is rendered per-call; often, none of the chunk is rendered. The end result is that each block is only rendered once, even if the chunk is found in many tiles.

Even with this skipping, in my current branch, iterate_chunkblocks is still called once for every block, for every chunk, in every tile. This could explain the huge difference :P

Higher zoom levels are handled as they always have been: sticking four lower-zoom images together and scaling down by half.

@Xon : would it mess up your git too badly if I did this inlining on my branch, since it's basically universally-agreed as good? It's a quick fix, and I can tell git you're the author.

Xon commented 13 years ago

I think some sort of dependency graph where a tile has many chunks, and as all the required chunks are rendered the final tile is rendered would be the simplest way.

Keeping the required chunks in memory shouldn't be too expensive, especially if you render the title as soon as possible.

For an added bonus, you could link tiles requiring tiles. Then changing a single chunk would automatically propagate which tiles need updating without having the multi-pass stages like it currently does.

agrif commented 13 years ago

I'm not entirely clear on your first idea. Do you mean render updated chunks all at once like before, but store their images in memory until they can be rendered into tiles? Or do you mean keeping the ChunkRenderer (and all the level data) in memory, to be reused to render onto many tiles?

We already have a function that returns all the chunks in a given lowest-zoom tile, and this is what's used when rendering that tile.

For tiles, Overviewer already uses the mtimes to decide whether a tile needs updating, based on whether the 4 tiles that make it are newer, so there is dependency tracking. We can't really skip these mtime checks (or some other check, like the old hashes) because Overviewer can be cancelled mid-render. We can't just keep track of what tiles we've updated, because we could lose that list. With how it's coded now, Overviewer can be cancelled at any part of the render and still work correctly when started later.

Xon commented 13 years ago

There is a many<->many relationship between the chunks and tiles, what I'm talking about is an in-memory structure to track when all the chunks are rendered so the tile can be rendered.

What I'm suggesting is; Build a list of which tiles need to be updated require which chunks. Render each chunk and then check to see if a tile has all it's dependencies filled. If so, then the tile can be rendered otherwise keep waiting. As for the chunk, it needs to be kept around until all the tiles which depend on it have been rendered and then it can be released.

And of course working in a multi-process environment. You really don't want to re-render a chunk because tile A in process 1 needs it and same with tile B in process 2

Fenixin commented 13 years ago

@Xon, I have an old code that do what you says, iterate through the minimum possible blocks and returns a list of the only visible blocks in render order. The problem with this function is that it's slower than the original iterator (like a 40% slower). The returned list to render it's only around 2000~3000 blocks, so it's a lot less to iterate in _cunkrender. I've commented and pushed the code just in case is useful for this, sure in C it's faster.

Here is the link to the line of iterate chunk blocks, and here is the branch.

Also, good idea to keep chunks in memory.

eminence commented 13 years ago

hey guys, just wanted to report some good success i'm having with an experiment: using agrif's direct-to-tiles branch, i've re-written the x,y,z blockloop in chunk_render as a C module. it's very simple (doesn't do lighting, nor does it handle special blocks), and uses a modified version of agrif's alpha_over code. a -p 1 render of 1 region file goes from about 12 minutes to about 2 minutes.

as soon as i clean up the code and track down some visual inconsistencies i'll publish the code for you all to play with.

agrif commented 13 years ago

@Xon : I looked over the code you've commited so far, and it looks good! I noticed what might be a bug in your inlining: I noted it here.

I'm running longer tests on Fenixin's kindly-donated larger map that have yet to finish, but I'd like to echo that the iterator helps substantially. My tiny map takes ~150 seconds with master, ~110 with the direct-to-tiles changes, and ~95 with the inlining and DTT (averaged over 3 runs).

About your render idea: as it stands right now, even though chunk_render is called many times per chunk, each block in the world is only rendered once, so I don't see the advantage to your method when it comes to rendering improvements. I do see that it would cut down on file reads, though, which is a good thing. But I think file reads could also be reduced just as much just by caching ChunkRenderers, but not rendering each chunk whole, which is a great deal simpler to implement.

I may be missing something here, though, and I can't say with any certainty how any of these approaches would affect performance. We'd just have to try them out.

@eminence : fantastic! I can't wait!

Xon commented 13 years ago

I've updated the region caching to also cache chunks. This makes a nice difference with the current code which tends to touch a single chunk a considerable amount. If code does get added to only render a chunk once, this code should be removed. It does also fix nbt.MCRFileReader.get_chunk_timestamp() to actually read from the existing cache since there was no codepath to trigger nbt.MCRFileReader.get_chunks() which has a fastpath for loading all the chunk timestamps & location data.

I've also reduced the number of system calls in the update scan(render_worldtile & render_innertile). Checking if a file exists and getting it's modified time both end up calling the syscall "stat" anyway.

Besides the region/chunk caching changes, this will only been be quite small improvements compared to improving the actual rendering.

:edit: @agrif, Yeah the inlining is bugged, but it was just a rough test to see if how blocks where being selected was an issue. Even tho each block is only rendered once, the process of iterating over all those blocks is slow

eminence commented 13 years ago

I've pushed my C-rendering code to the direct-to-tiles branch in my fork: eminence/Minecraft-Overviewer@66a9306a29e2c6a3030cd1c2dadbc3ba157dbecf

This new code should be pretty easy to read. I've tested only on Linux, but it should build on any platform with the right development headers. note the TODO comments in iterate.c

I've seen evidence of a weird image corruption bug that I don't understand the root cause. Seemingly random tiles will be corrupt. Deleting them and re-running Overviewer causes them to be re-generated correctly. i am interested if any of you will see this.

back to the inlined iterate_chunkblocks stuff -- since this simple change is such an improvement, i'd like to get it merged into the master overviewer branch so that everyone else can use it. does anyone happen to have a tested pull-request ready?

agrif commented 13 years ago

@Xon : fair enough on slow iteration. Just ran a quick test and got 0.002 seconds for an empty (x, y, z) loop with just a continue statement, so for your ~5000 chunk_render call map that's 10 seconds total -- enough to make a difference. Maybe there's a way to restrict the iteration ranges based on image size, xoff, yoff...

@eminence : I have a version that appears correct, though I'll have to rebase it onto brownan/master. If noone else has one ready I can submit a pull request.

Also, the Overviewer directory is becoming cluttered with files... which is a strictly cosmetic problem, I know :P maybe we can look into reorganizing Overviewer into a python package?

Xon commented 13 years ago

@agrif, There are some checks in the inner-loop which look like this; if imgx >= img.size[0] + 24 or imgx <= -24: continue if imgy >= img.size[1] + 24 or imgy <= -24: continue
This really should be used to calculate the correct x,y iteration range rather than being an explicit continue. Just moving them outside the "for z in xrange(128):" instead of inside is a reasonable performance improvement.

agrif commented 13 years ago

@Xon : Hmm... imgy is changed inside the for z in xrange: block though. Right now you're just checking if the top (or bottom, can't remember) of each column is inside the image. And we don't want to draw a whole column at a time, even if part of it is in the image, because most of the chunk is either above or below the tile image...

I agree that there should be a calculation of the part of the chunk inside the image; this code was mostly a quick change because I couldn't figure out why my renders were taking twice as long as before.

eminence commented 13 years ago

as another data point, here are the rending times for a 36 region world: inline iterate_chunkblocks: 229 minutes C-implemented rendering: 53 minutes

Xon commented 13 years ago

I think I've got the update scanning code going about as fast as it will go. Probably the last remaining issue is the region cache invalidation code honestly is quite horrible.

Weakref is too agressive and causes endless loading of the chunks which really slows down tile rendering. I'm thinking of using the region location information encoded into the filename to do geolocation caching (ie only keep stuff within some distance of the last loaded chunk) but that isn't ideal.

I think having the region/chunk cache code at in the nbt file is utterly the wrong place for it. It needs to be higher where it can be aware of when a chunk is no longer needed.

As for the updating scanning code, about the only source of performance improvement I can think of without restructuring the update process is to cache the last mtime of updated tiles on the layer below to reduce the number of syscalls.

Introducing something like --regionlist would dramatically help by reducing the number of tiles to consider quite dramatically for large worlds.

Here is a world of ~71mb, 30 regions which is running off a slight variation of the code I have in my github repos. The changes are; black background, jpeg optimization tweaks(save with; optimize=1,quality=70), and some templating stuff for MapMarker integration & online playerlist.

It manges to-do update runs every 10 minutes, but I need to integrate extracting biomes in some easy way. It would be nice if the render process just didn't stop if it can't find biome data for a region when biome's are enabled. There needs to be some way to say the biomes have been update for 'x' regions, re-render regardless of the mtime checks say.

agrif commented 13 years ago

For region caching: What about a simple FIFO? The tiles being rendered at any one time are (generally) near each other in the world, and FIFOs are nice and easy to implement. Also, I agree that this probably shouldn't be in with the file reader classes. Those are theoretically usable on their own; caching should be a layer above. Maybe you can put it in the world class? It's not doing much anymore.

As for biomes, those have always been a bit of a pain just because we'll always need to use an external (java!) program to extract them. If there was a way to read them out of the chunk files, that would be great... too bad. You could check the biome file mtimes, but you'd end up re-rendering an entire region. Maybe we could ask the Biome Extractor author to add an mtime table?

Thank you for going through all this trouble, by the way. Optimization can be tedious work sometimes, but it's really starting to pay off for Overviewer.

Xon commented 13 years ago

That sounds like a good improvement on region caching. I'll move it to the World object and use a FIFO queue. I'll also integrate biome data caching into the region caching.

I've added some code which lets it ignore missing biome info, so at least the render doesn't just halt. Saddly the biome format really isn't friendly for detecting changes. Minecraft doesn't change the biomes for existing chunks, only on new chunks (deleting & recreating a chunk counts) and when Notch introduces new features.

I've also identified some performance improvements with multi-process rendering during updates. The mtime check is so lightwieght, spinning off it off into another process is actually slowing an update scan with just a few changes down. I'll need to test, but I think the best way is for each worker process to fetch a few dozen items at a time rather than one at a time.

As for the work with optimizations, its a nice change of pace from my daytime coding job.

agrif commented 13 years ago

I've merged eminence's iterate.c code into my dtt-c-render branch, where I've merged it and the composite code into a single compiled extension ("_coverviewer") so there isn't any duplicated code. I also changed the external interface for alpha_over to be easier to use in C (now you don't need to deref the return value), and I overall changed the code formatting I used to be more like eminence's, which is way closer to standard CPython style. Apart from that, though, I haven't really touched the iterate code itself.

I ran some long tests overnight on a bigger map (thanks Fenixin!). I tested by rendering 5 times per version, with an empty cache each time. Here's the averages:

branch            time                %
---------------------------------------
master            4215s (70m 15s)    - 
direct-to-tiles   2390s (39m 50s)   57%
dtt-c-render      1024s (17m  4s)   24%

These don't include any of Xon's changes, though, except for inlined iterate_chunkblocks, and I suspect a lot of this time is system time.

eminence commented 13 years ago

looks great, agrif, thanks for merging and cleaning up. i think the next step is to expand the c-renderer to re-implement all of the features that i ripped out.

this is looking like a really great base the next-gen version of Overviewer

Xon commented 13 years ago

This weekend I'm planning on merging my changes into a fork of agrif's branch. Region/chunk/biome caching still needs some work, and a --regionlist function needs some debugging.

A lot of the system time would be from extra syscalls which aren't needed. The direct-to-tiles branch you would be testing ends up loading every chunk about a dozen times per tile.

eminence commented 13 years ago

I've spent the past few days learning openCL. this weekend I put together a version of overviewer that uses openCL to do compositing on a GPU. unfortunately, it's about 50% slower than the code that's currently in agrif's dtt-c-render branch (2 minutes versus 4 minutes on my 4 region test world). it may be that the images we're compositing are too small to have the parallelization of a GPU out-weight the openCL overhead. i'll keep poking at it to see if i can get any more speed out of it

Fenixin commented 13 years ago

@eminence, I have no idea of OpenCL, but if you says the images are too small, it comes to my mind that maybe OpenCL is more useful for rendering the tiles of the higher zoom levels (pasting inner tiles on a new image and resize it), but I repeat that I have no idea how OpenCL works.

Fenixin commented 13 years ago

An idea for agrif's dtt-c-render branch:

Instead of calling ImageDraw.Draw().line to draw the edge lines (as done here in the old chunk.py) we can make a few non-existent-in-game blocks in textures.py using non-used blockids (maybe blockid > 255 ?) with the lines drawn, and use composite alpha to draw the line, pasting the texture with only the line. This should be faster than drawing. Do you think it worth the try? If so, I'll code it.

agrif commented 13 years ago

Fenixin: I'm not sure if that will help that much. Using alpha_over will always iterate over all the pixels in whatever image you're drawing to the tile, even though it will try to skip what it can. Even the simplest line-drawing algorithms will touch way fewer pixels.

I can't find the exact line of source that does the drawing in PIL, but I'd guess their line-drawing function would be faster than compositing a pre-rendered line image. I don't know, though; PIL tends to suprise me with how weird it is sometimes.

eminence commented 13 years ago

i'm inclined to agree with agrif here. i haven't double checked, but i believe the profiling data from xon suggests that line drawing is a pretty small part of the rendering process.

two unrelated thoughts:

1) since the performance increase in the dtt-c-render code is so much faster than the current code, i'd like to get it pushed to a branch in brownan's overviewer. I hope that this wider exposure and testing will help us move this code along. before we do that, is there anything else we want to fix or polish?

2) agrif, can you comment on how --zoom might behave with your direct-to-tile code? i think there's always been some confusion on how this option is supposed to work (myself included)

overviewer / Minecraft-Overviewer

New cache format #271

yield x,y,z,imgx,imgy