janelia-flyem / dvid

Distributed, Versioned, Image-oriented Dataservice
http://dvid.io
Other
197 stars 33 forks source link

Feature request: API endpoint to tell me which regions / locations are written to #155

Open grisaitis opened 8 years ago

grisaitis commented 8 years ago

For a given data instance in a repo, I'd like to have a way of knowing which block locations have been written to. This is useful for many use cases for me, like:

Currently, my client fetches chunks from dvid not knowing if the chunks have data (or whether or not the chunk is "null" in database speak). This makes my operations pretty inefficient. I'd like to avoid fetching null blocks. This feature would enable me to do so.

ideas for how a solution could look...

I can imagine two behaviors or interfaces to accomplish what I want:

  1. A GET endpoint that, for a given data instance, returns a list of the blocks that are written to / non-null
  2. A GET endpoint that, for a given slice location (size and offset), returns true or false for whether that slice includes any non-null data
    Idea 1 could look like...
    • Request:

GET http://hostname:port/api/node/<uuid>/<data instance name>/blocks

{
    "BlockSize": [32,32,32],
    "OccupiedBlocks": [[1,0,0], [2,1,3]]
}

where that means the only blocks written to are [32:64, 0:32, 0:32] and [64:96, 32:64, 96:128].

Idea 2 could look like...

GET http://hpstname:port/api/node/<uuid>/<data instance name>/exists/Lx_Ly_Lz/x0_y0_z0

{"Exists": true}

or

{
    "ExistsPartly": true,
    "ExistsEntirely": false,
    "ExistsFraction": 0.75,  # if, say, 75% of the slice includes written-to voxels 
}

I'm not a REST expert :baby: but hopefully my intentions are clear.

DocSavage commented 8 years ago

We'd want something like Idea 1 because the main benefit of this new API endpoint is to not have to constantly ask "are you there" for each block in a vast voxel space. An issue with idea 1 is how to handle millions or billions of blocks. We could just transmit them all and suffer with potentially massive responses. We could do RLE encoding and return a list of [z, y, x0, x1] spans similar to what we do for sparse volumes. This would decrease the payload probably by an order of magnitude or three. We could also do a paging return where we return N blocks and a token that can be used for a subsequent request that returns the next N blocks, etc.

I think initially the RLE approach should be done for the obvious block-oriented data types like uint8blk and labelblk. The labelvol data would be more problematic.

grisaitis commented 8 years ago

Which idea would you say is less complex to implement?

Idea to improve idea 1: take arguments for shape and offset. This would limit the scope of the request - and thus also the size of the response.

If I could snap my fingers and have either right now, I'd choose interface 2. Because:

One the downside,

Idea 1 pros:

Idea 1 cons:

grisaitis commented 8 years ago

Another thought: Perhaps an ROI could be created for every voxel that's written to, and then one could simply query the mask endpoint of that ROI to find out which voxels in that slice have been written to:

GET <api URL>/node/<UUID>/<ROI name for written-to voxels>/mask/0_1_2/<size>/<offset>