Reduce read requests when creating sharded skeletons and meshes

ranlu commented 11 months ago

This PR makes two changes to reduce read api calls when created sharded formats:

Cache spatial_index files. Because all the spatial_index are used by each skeleton/mesh merge tasks, the total reads goes quadratically with the volume (total number of spatial_index files x total number of tasks). Cache it makes the growth linear.
Optionally save the intermediary *.frag files to a separate location specified by frag_path. To create sharded format, each segment must be read from frag files at least once. Save and read them from a shared nfs volume reduces the read requests significantly.

william-silversmith commented 11 months ago

Hi Ran,

Thanks for this PR! I think the frag location is a useful adaptation.

Are you familiar with the spatial index database? That would improve performance more than caching the index files. The index is downloaded once and then only the relevant section is accessed quickly.

ranlu commented 11 months ago

I am aware of the database but never tried it. Setting up a separate database feels like a tiresome requirement. BTW, I think the performance of the file base spatial_index files can be easily improved a lot by switching from json to numpy array. I understand you want the spatial_index file to be human readable, have you considered saving two sets of files, one with json and the other with numpy?

william-silversmith commented 11 months ago

Were you aware there's an igneous command to automatically generate the database for you and that it's available as both sqlite and mysql? In sqlite form, it's not that different from caching and more efficient.

I didn't pick this format lightly, it was during processing minnie65 that I needed to keep scaling the performance because it was untenable otherwise. While caching does raise the performance ceiling, decoding and reading become the next bottleneck. Switching to numpy will raise that ceiling further, but the even better solution is already implemented as the database will use an index to read only what is necessary.

Numpy will probably be great for one process up to a certain size, but as you scale up, I suspect you will run into an IO bottleneck.

Would you mind giving sqlite a try? You run it once before starting shard merging and just pass in the file location.

On Thu, Jul 20, 2023, 11:08 PM ranlu @.***> wrote:

I am aware of the database but never tried it. Setting up a separate database feels like a tiresome requirement. BTW, I think the performance of the file base spatial_index files can be easily improved a lot by switching from json to numpy array. I understand you want the spatial_index file to be human readable, have you considered saving two sets of files, one with json and the other with numpy?

— Reply to this email directly, view it on GitHub https://github.com/seung-lab/igneous/pull/156#issuecomment-1644922937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSNHBBGMOFRZ4GUVNYDXRHXDVANCNFSM6AAAAAA2SF57A4 . You are receiving this because you commented.Message ID: @.***>

ranlu commented 11 months ago

I agree for very large dataset, a centralized database is probably inevitable. But for dataset around the size of FAFB, file based approach is convenient and cheap enough after caching. Switching to sqlite for each worker might bring a small speed up but I doubt it will be significant as the actually merging operation takes a lot longer . Anyway, I guess the bottom line is there is no reason to keep the current behavior: downloading spatial_index files repeatedly for every task.

ranlu commented 11 months ago

BTW, do you plan to support postgresql, I think it is more popular and well maintained than mysql these days.

william-silversmith commented 11 months ago

But for dataset around the size of FAFB, file based approach is convenient and cheap enough after caching.

I see, I forgot that FAFB was comparably sized to pinky100 at near-isotropic resolution. pinky100 is at about the limit where I consider doing something special unnecessary. In this case, caching is a reasonable solution for reducing data usage.

Switching to sqlite for each worker might bring a small speed up but I doubt it will be significant as the actually merging operation takes a lot longer.

I haven't measured this in a long time so I might be mistaken, but iirc the download + parsing time is very significant for large datasets, and parsing can be bigger than download.

BTW, do you plan to support postgresql, I think it is more popular and well maintained than mysql these days.

It would be possible if there is a compelling reason to add more complexity (e.g. you are already running a Postgres instance and want to use it for a second purpose). I just happened to know the MySQL syntax and on GCP you can spin up a new MySQL instance just as easily as a Postgres one. I need to make a tutorial for how to set it up, it's very straightforward once you know the right options to set.

What platform are you running your process on? I was assuming it was a university cluster and not GCP. Sqlite works very well with no extra setup on common filesystems.

ranlu commented 11 months ago

OK, I may give sqlite a try next time I run something on our local cluster. Any other suggestions or comments about this PR?

william-silversmith commented 11 months ago

I've gotta fix the tests first will get back to you.

On Fri, Jul 21, 2023, 8:31 PM ranlu @.***> wrote:

OK, I may give sqlite a try next time I run something on our local cluster. Any other suggestions or comments about this PR?

— Reply to this email directly, view it on GitHub https://github.com/seung-lab/igneous/pull/156#issuecomment-1646356812, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSMDD25D2GFBLG2LRM3XRMNM3ANCNFSM6AAAAAA2SF57A4 . You are receiving this because you commented.Message ID: @.***>

william-silversmith commented 11 months ago

Rebased with fix for tests.

william-silversmith commented 11 months ago

Looks a lot better. Thank you for working all the way through this @ranlu!

ranlu commented 11 months ago

Actually I missed a piece that passes the cache parameter to the skeleton tasks, can you commit this fix? I can make another PR if you prefer that.

diff --git a/igneous/tasks/skeleton.py b/igneous/tasks/skeleton.py
index 7325b32..e064069 100644
--- a/igneous/tasks/skeleton.py
+++ b/igneous/tasks/skeleton.py
@@ -363,7 +363,7 @@ class ShardedSkeletonMergeTask(RegisteredTask):
   ):
     super(ShardedSkeletonMergeTask, self).__init__(
       cloudpath, shard_no,  
-      dust_threshold, tick_threshold, frag_path, spatial_index_db,
+      dust_threshold, tick_threshold, frag_path, cache, spatial_index_db,
       max_cable_length
     )
     self.progress = False

william-silversmith commented 11 months ago

done

seung-lab / igneous

Reduce read requests when creating sharded skeletons and meshes #156