twoolie / NBT

Python Parser/Writer for the NBT file format, and it's container the RegionFile.
MIT License
365 stars 74 forks source link

Chunk and Region iterators in WorldFolder #44

Open macfreek opened 12 years ago

macfreek commented 12 years ago

A relative recent addition to NBT is world.py with the WorldFolder class. The expected use is for tools that iterate through all Chunks, without caring about the specific Region file.

A common complaint I hear is that NBT is slow. One way to speed things up is to process each region file using a different subprocess and combine the results (this would be a Map-Reduce pattern). The best way to implement this is using a callback function.

E.g.:

def count_blocks(chunk):
    """Given a chunk, return the number of block IDs in this chunk"""
    chunk_block_count = [0]*256   # array of 256 integers, one for each block ID
    for block_id in chunk.get_all_blocks():
        chunk_block_count[block_id] += 1
    return chunk_block_count

def summarize_blocks(chunk_block_counts):
    """Given multiple chunk_block_count arrays, add them together."""
    total_block_count = [0]*256   # array of 256 integers, one for each block ID
    for chunk_block_count in chunk_block_counts:
        for block_id in range(256):
            total_block_count[block_id] += chunk_block_count[block_id]
    return total_block_count

world = WorldFolder(myfolder)
block_count = world.chunk_mapreduce(count_blocks, summarize_blocks)

However, I fear that the term "mapreduce" is not well know with all programmers, and I'm looking for an easier name. Would the following be easier to understand?

world = WorldFolder(myfolder)
chunk_block_counts = world.process_chunks(count_blocks)
block_count = summarize_blocks(chunk_block_counts)

The advantage is that the parallelisation can happen behind the scenes (though the multiprocessing.Pool class already makes it very easy).

The disadvantage is that it adds a third method to the existing get_chunks and iter_chunks methods in the WorldFolder class. In addition, there probably also need a process_nbt and process_regions next to process_chunks.

In retrospect, the difference between get_chunks (which returns a list) and iter_chunks (which returns an iterator) is so minor (iterators consume less memory, but lists can be cached) that it did not warrant the double function.

I'm inclined to remove the cached get_chunks (though I liked the name better than iter_chunks).

Any opinions?

stumpylog commented 12 years ago

My opinion would be to remove the get_chunks and rename iter_chunks. I think usual usage is moving through the chunks, and not often needing caching of chunks for later access. The multiprocessing``Map-Reduce is beyond what I know, so I can't really say about that.