Render server resource recommendation

martinschorb commented 3 years ago

Hi,

we are about to deploy a dedicated machine now for render. What would be the recommended resources in terms of CPU/memory for reasonable performance? Do many CPUs help in handling the requests?

I think, most crucial will be the network connectivity to both the storage and compute systems, correct?

trautmane commented 3 years ago

Hi Martin,

Glad to hear you plan to set up dedicated resources for render.

As you might guess, everything depends upon your expected usage. Do you have any idea at this point how much data you plan to manage and how many concurrent users/clients you expect to have accessing the data in render?

For data sizing, how many tile specs you plan to support actively at one time (not the source image sizes) is what matters. Consider that you likely will have multiple tile specs for each image as you iterate through different alignments.

Janelia's peak data level to date is: 865,561,107 tile specs in 739 stacks and 498,295,676 match pair documents in 154 collections

I recently archived a bunch of older data we no longer need to be actively available in preparation for our next big reconstruction effort. We currently have: 174,329,219 tile specs in 403 stacks and 23,183,265 match pair documents in 80 collections

At Janelia, there are really only a few (<5) folks working with render directly at any time. Once the reconstruction alignments are done by specialists, the aligned data is materialized to disk (png, n5, ...) and that data is imported into downstream systems that often support large numbers of users (e.g. tracers). The alignment specialists do however run batch jobs on our cluster that can issue many millions of requests to the render web services. So, I guess the key would be to get an estimate of how many concurrent alignments (of the "average" stack with ? tile specs) you plan to support.

Unless your expected data and concurrent usage is very small, you will want to have a dedicated box for the mongodb database and then additional server(s) - physical and/or virtual machines - to host the render web services.

Re: Do many CPUs help in handling the requests? Yes, more CPUs will help manage more concurrent requests.

Our setup at Janelia has a beefy centralized database server with lots of CPUs and lots of memory, one not-quite-so-large centralized web services host for pure data requests (not rendering), and then several small and medium sized VMs for server-side rendering and other things. The large scale cluster jobs use clients that do heavy lifting (e.g. rendering) client-side. This is the key to scalability. Relatively small meta data is fetched from the web services and then the distributed clients read source pixels from network storage and do the hard work locally.

Although the web services support server-side rendering for small use cases, that won't scale with a centralized host.

So - I've put a lot of info here without giving you a specific answer. Let's start with some data and usage estimates so we can work towards an appropriately sized system.

trautmane commented 3 years ago

This diagram shows the hardware we've used for the past 4 years. In the interest of full disclosure, our mongodb has moved to slightly different hardware and we never ended up installing a Varnish cache - but otherwise I think this is an accurate depiction.

martinschorb commented 3 years ago

Hi Eric,

thanks for the detailed explanations. I see the your DB machine has rather large memory. Would it hold the DB there?

Our use case would be stacks with sizes ranging from 10k to 1M tiles each, so already significantly smaller than yours. Parallel alignment tasks will also be maximum 2-3. I expect one alignment and export run to last a working day or less, so this reduces parallel load.

We will do the client-side tasks on our cluster or on dedicated machines.

Looking at your numbers and the difference in dataset sizes, I think I might start with a single machine running both render -WS and the DB. I see the your DB machine has rather large memory. Would it hold the DB there?

Also, one goal of my work would be to provide a comprehensive entry package for using the Render environment for aligning rather generic volume EM data of different kind. Ideally this could be more or less easily deployed to other institutes/infrastructures that like to use it as well. To lower the starting obstacle I would suggest to keep the basic standard setup rather simple. Can you suggest some metric to look at to see if there is a bottleneck when running many requests? Just the process load on that machine? At some point, we might then decide to split DB and WS.

What are the differences between the "renderer-data" large VMs and smaller ones? They all run separate instances of the Render WS and connect to the single database, correct? Do you connect to different instances for different client scripts or are they dedicated to taking over different tasks during a single client call?

Also, I don't quite understand yet, what exactly is done server-side. Specifically:

What is the role of the mipmaps during client calls? Will the client script receive the file location of the mipmap(s) when requesting a tile at some scale and fetch relevant image data themselves?
When requesting scale=0.6, is the scaling done server-side and the tile image data delivered or would the client script do the scaling before processing?

Thanks!

trautmane commented 3 years ago

Hi Martin,

re: database hardware architecture ...

Yes, the DB machine needs enough disk storage to hold everything in render and enough memory to hold the data (stacks and match collections) being actively accessed. Theoretically, you can reduce database disk storage requirements by creating bson dumps for inactive stack data, moving the dumps to cheaper "offline" storage, and removing the inactive data from the database. This is a bit of a hassle though, so I've only done it a couple times once I knew we were really done with a large project.

Although it is tempting to co-host the database and the web server, I would not recommend this long-term unless you are managing a fairly small amount of data (in total). I suppose you could start with everything together and separate components later when things grow and break down.

How long do you think you'll need to hold onto the 10k - 1M tile stacks? Are the "one working day" alignments likely to be related to each other (e.g. different parts of the same volume) or are they completely independent projects?

If most/all of your projects are small and independent, then it might make sense to have lots of small (co-located) database + web service deployments where each deployment is dedicated to a project. If instead you need to keep the data around for longer and/or you need to move back and forth between lots of related data, the separate centralized database model is better - this what we do at Janelia and what is also done at the Allen Institute.

trautmane commented 3 years ago

re: comprehensive entry package

I like this idea and think a Docker containerized deployment is the best way to set this up. Maybe we can revisit render's current all-in-one Docker example to make it more user friendly and suitable for small scale use.

To determine whether a particular setup is performing well, the easiest thing to do is run some typical jobs (import, match generation, materialize to disk) and then look at the web services access log (jetty_base/logs/-/-access.log). The log will show response times for each request which will typically be in the 0 to 5 ms range (20-50 ms for larger requests) if the system is running well.

trautmane commented 3 years ago

re: different sized web services VMs and server-side rendering

Yes, each VM hosts a separate render web services instance and they all connect to one centralized database. I separated the web servers to handle different types of tasks. The small VMs are dedicated to simple data retrieval use - no server-side rendering. The larger VMs are used for server-side rendering. Finally there is a large web services instance on real hardware that provides reliable large scale data access with no (or only very small occasional) server-side rendering.

The main idea is to separate resource intensive server-side rendering from pure data retrieval. The data retrieval services need to be always available while the server-side rendering is "extra-credit". Separating to different hosts means that too many server-side rendering requests won't cause trouble for concurrently running cluster jobs that only need data.

Technically, almost everything can be done server-side (because the same libraries are used) but practically it works best when rendering is done client-side.

When a client retrieves tile specs, the mipmap paths (or the ability to derive the paths for a stack) is included. The client then reads the mipmaps directly from network storage (same as the source images).
When requesting a scaled down image within a client, the client loads the appropriate mipmap (if available) and then downsamples as needed. Having mipmaps makes this process much faster, but it will still work (slowly) if you don't have mipmaps.
When requesting a scaled down image (not tile spec) directly from a server, the server acts just like a client and downsamples in the same way using mipmaps if they are available.

martinschorb commented 3 years ago

Hi,

I have a related question regarding the N5 export through hotknife. When the client requests image data from a volume chunk of a transformed stack, where will it get the voxel data from? Is it calculated server-side or will the server provide the location of the original pixel values from the underlying tiles and the transformed data is calculated client-side? If that is server-side, I would try to dynamically set up massive server resources when needed for an export (as cluster job(s)).

trautmane commented 3 years ago

Hi Martin,

The SparkConvertRenderStackToN5 client in the hotknife repository only pulls JSON tile specs from the render web service and the transformed data is calculated client side. This is why the system scales up well for large data sets.

Each task running on a worker node uses the JSON tile specs to read 2D pixel data (typically from network storage), transform it in memory, and write it back out to disk in 3D voxel blocks. Technically, the pixel data could be read from another web service (e.g. s3 buckets) if the tile spec source URLs reference services instead of 'file://' paths.

martinschorb commented 3 years ago

That makes sense, I am surprised by how much this task seems to be by far the most resource intense. Does it make sense to use more nodes with less CPUs or less nodes with many CPUs, I guess that I/O of the storage also is a factor... Are the processing requirements strongly dependent on the output chunk size? For performance reasons when reading and transferring the data to S3, we usually choose something between a few hundred kB and 1MB of chunk size. What are your preferences?

trautmane commented 3 years ago

You have to look at each specific use case, but for 3D voxel rendering the key constraint is that you usually have to (repeatedly) load many larger xy 2D tiles to render a smaller 3D block. So, it is likely that the process is IO bound and not CPU bound. Saalfeld included a tileSize parameter to improve the mismatch between source data slices and output blocks.

At Janelia, we typically use fairly small block sizes (128x128x64: 800K-ish each compressed?) for the n5 volumes mostly because it makes viewing fast. No one worries too much about the compute cost to generate the volumes. The guy who designs and maintains our network file systems hates that we create zillions of tiny block files.

martinschorb commented 3 years ago

Can I trust the memory display in the spark web UI (still the N5 export)? It shows very little usage (in the kB range) for "Storage Memory" despite allocating 120GB per executor (each with 30 cores). Also "Input" and "Shuffle Read/Write" is zero.

trautmane commented 3 years ago

You can trust it, but it doesn't mean what you think it does.

From https://spark.apache.org/docs/latest/tuning.html#memory-management-overview:

storage memory refers to that used for caching and propagating internal data across the cluster

The n5 generation spark jobs (and most of the render spark jobs) don't use spark for it's traditional map-reduce features where data gets cached and shuffled around the cluster. The n5 and render jobs mostly just use spark for basic parallelization and fault tolerance (task retries), so things like storage memory aren't relevant.

saalfeldlab / render

Render server resource recommendation #118