Feature request: API endpoint to tell me which regions / locations are written to

grisaitis commented 8 years ago

For a given data instance in a repo, I'd like to have a way of knowing which block locations have been written to. This is useful for many use cases for me, like:

migrating data from one database to another
random sampling from a data instance

Currently, my client fetches chunks from dvid not knowing if the chunks have data (or whether or not the chunk is "null" in database speak). This makes my operations pretty inefficient. I'd like to avoid fetching null blocks. This feature would enable me to do so.

ideas for how a solution could look...

I can imagine two behaviors or interfaces to accomplish what I want:

A GET endpoint that, for a given data instance, returns a list of the blocks that are written to / non-null
A GET endpoint that, for a given slice location (size and offset), returns true or false for whether that slice includes any non-null data
Idea 1 could look like...
- Request:

GET http://hostname:port/api/node/<uuid>/<data instance name>/blocks

Response:

{
    "BlockSize": [32,32,32],
    "OccupiedBlocks": [[1,0,0], [2,1,3]]
}

where that means the only blocks written to are [32:64, 0:32, 0:32] and [64:96, 32:64, 96:128].

Idea 2 could look like...

Request:

GET http://hpstname:port/api/node/<uuid>/<data instance name>/exists/Lx_Ly_Lz/x0_y0_z0

Response:

{"Exists": true}

or

{
    "ExistsPartly": true,
    "ExistsEntirely": false,
    "ExistsFraction": 0.75,  # if, say, 75% of the slice includes written-to voxels 
}

I'm not a REST expert :baby: but hopefully my intentions are clear.

DocSavage commented 8 years ago

We'd want something like Idea 1 because the main benefit of this new API endpoint is to not have to constantly ask "are you there" for each block in a vast voxel space. An issue with idea 1 is how to handle millions or billions of blocks. We could just transmit them all and suffer with potentially massive responses. We could do RLE encoding and return a list of [z, y, x0, x1] spans similar to what we do for sparse volumes. This would decrease the payload probably by an order of magnitude or three. We could also do a paging return where we return N blocks and a token that can be used for a subsequent request that returns the next N blocks, etc.

I think initially the RLE approach should be done for the obvious block-oriented data types like uint8blk and labelblk. The labelvol data would be more problematic.

grisaitis commented 8 years ago

Which idea would you say is less complex to implement?

Idea to improve idea 1: take arguments for shape and offset. This would limit the scope of the request - and thus also the size of the response.

If I could snap my fingers and have either right now, I'd choose interface 2. Because:

small response size, as you said
it's a fundamental database operation - is a data point null or not?
it's (i assume) a cheap operation.
interface is intuitive / consistent to me - similar to GETs for "raw" or "isotropic"
I don't want my client worrying about blocks. From my perspective, blocks are an implementation detail in how the database does its job. I want the database to handle any block-related complexity for me. :)

One the downside,

yup, more requests!! 📬

Idea 1 pros:

as you said, get everything in one request, or batch of requests if it's paginated

Idea 1 cons:

big response (But this could be addressed with query arguments for shape and offset.)

grisaitis commented 8 years ago

Another thought: Perhaps an ROI could be created for every voxel that's written to, and then one could simply query the mask endpoint of that ROI to find out which voxels in that slice have been written to:

GET <api URL>/node/<UUID>/<ROI name for written-to voxels>/mask/0_1_2/<size>/<offset>

janelia-flyem / dvid

Feature request: API endpoint to tell me which regions / locations are written to #155

ideas for how a solution could look...

Idea 1 could look like...

Idea 2 could look like...