janelia-flyem / dvid

Distributed, Versioned, Image-oriented Dataservice
http://dvid.io
Other
195 stars 33 forks source link

Sparsevol-coarse missing #370

Closed DeadpanZiao closed 1 year ago

DeadpanZiao commented 1 year ago

Hi,

While putting large volumes of label data into a labelmap, the DVID server get killed sometimes (probably due to memory error). I checked the label through neuroglancer, it looks good. However, the sparsevol-coarse data are missing. I can't do any proofreading or generate mesh through neutu. Is there a way to fix this?

Many thanks.

DocSavage commented 1 year ago

How are you ingesting the label data into DVID? Are you using /raw or /blocks (the former is much better)? We typically add terabytes of label data without any kind of crash. You might be able to (1) check the server logs to see what caused the crash, e.g, there could be a panic with an error message or important debugging information, and (2) decrease the speed of doing POST requests if there are too many outstanding requests for your server speed and memory.

Is the sparsevol-coarse data missing for just some labels? If so you can ingest them directly using this endpoint: https://github.com/janelia-flyem/dvid/blob/master/datatype/labelmap/labelmap.go#L1432 The /sparsevol-coarse is essentially returning that label index structure. The "nuclease" python library has a number of functions that can use the DVID API. An example of posting the label indexes that you may be missing: https://github.com/janelia-flyem/neuclease/blob/master/neuclease/dvid/labelmap/_labelindex.py#L158

DeadpanZiao commented 1 year ago

Thanks for the suggestion!

I am not sure I am ingesting the data correctly. The label data is over 100 GB and it seems I can't put them in one shot (the server log shows the data exceeds the max limit). Thus I decompose the data and ingest them according to their size and offset. I use multi-thread in python to send /raw requests in parallel. The first requests goes fine, and the server logs prints info like 'stored label xxx with x blocks'. However, while the second request comes in, the whole service get killed and no logs are printed.

Anyway, I'll take a look at neuclease.

DeadpanZiao commented 1 year ago

It seems it's the problem with the label index store procedure taking too long. 10 GB data can take over a couple days according to the log. During the process, if another post request comes in, the service just shut down quietly and no logs are printed. The labelmap index left unrestored are lost from the database. By the way, the server that dvid are deployed on has 32 CPUs and 128 G memory.

DocSavage commented 1 year ago

There's probably something else going on because 100 GB is a small label volume for us and should be easily handled by the server with those specs. A key thing is to pay attention to your block sizes, the subvolumes you send with POST /raw though we mostly use POST /blocks where the uint64 labels (64x64x64 voxel blocks) are compressed heavily using our segmentation compression scheme on the client(s) side. You definitely don't want to POST overlapping blocks using /raw but I doubt you are doing that. If you do low-level ingest with POST /blocks you also have to ingest the label indices separately. Since we do a lot of clustered work on our supervoxel and agglomerated label volumes, we do that outside of DVID and just ingest the compressed segmentation blocks + label indices structures (one per label) through many POSTs.

I really suggest you check the logs to see if there are errors or panics. I could also just Zoom chat with you to understand your approach, and that would also let you share your screen. Just to double check, your input label volume is some 3d array of uint64?

DocSavage commented 1 year ago

Could you also say how you are installing DVID / OS / etc. A brief example of the HTTP requests you're making would also be helpful, e.g., how what is the form of the POST /raw full URLs you are using and verify the data you are sending in the POST body.

DeadpanZiao commented 1 year ago

Thanks for the reply! Yes, I am ingesting 3d array of uint64. The whole data is about 8192 6144 512 and I am slicing them into 1024 1024 512 blocks and putting them into dvid through python script. An example python function I write by myself:

def dvid_put(addr, uuid, name, offset, data, mutate=False):
    if not addr.startswith('http'):
        addr = 'http://' + addr
    if len(data.shape) == 3:
        size = data.shape[2], data.shape[1], data.shape[0]
        concat = 'raw/0_1_2/'
    elif len(data.shape) == 2:
        size = data.shape[1], data.shape[0]
        concat = 'raw/0_1/'
    headers = {'Content-Type': 'application/octet-stream'}
    url = addr + '/api/node/' + str(uuid) + '/' + name + '/' + concat + '_'.join(map(str, size)) + '/' + '_'.join(
        map(str, offset))
    url += '?compression=gzip'
    if mutate:
        url += '&mutate=true'
    return requests.post(url, data=gzip.compress(data.astype(np.uint64).tobytes()),timeout=3000).status_code == 200

A typical instance is:

ip:port/api/node/c0ee/testa82_0707/raw/0_1_2/1024_1024_512/7168_11776_0?compression=gzip&mutate=true

I don't exactly recall how I install dvid (probably conda command according to official manual), which is over a year ago. It's deployed on a CentOS 7,x86_64 server.

A zoom chat would be fantastic if we can solve this together . I am not sure when you will be available. Since I am based in China, we may have a jet lag. Morning or evening may be the time we both at work.

DeadpanZiao commented 1 year ago

It seems I was using too large data in a single post request on above example. When I reduce the size of data within a single post to 646464 I can see that the label index storing speed become faster. However, as I monitored the memory, it seems that dvid never release memory after processing a request. Every time a request comes in, some memory are taken. Finally, all the memory runs out and the service get killed quietly.

DeadpanZiao commented 1 year ago

Now I crop the raw label into 64 pixel slices. Each time I only ingest a specific volume with an offset index z (0,64,128...). After all the images in that index z are stored, I move on to next index z. In this case, memory error doesn't show up any more! Now all the labels are stored.

DeadpanZiao commented 1 year ago

However, I get the following error sometimes:

ERROR indexing label 1600145892: unable to store indices for label 1600145892, data testa8_0711: Error on batch commit of Put: IO error: /data/dbs/basholeveldb/2326145.log: Too many open files
DocSavage commented 1 year ago

The last problem is probably due to the default limit on the number of open files. For larger datasets, you need to bump that up. Documentation for increasing max open files in your DVID process is here: https://github.com/janelia-flyem/dvid/wiki/Configuring-DVID#server-tuning-for-big-data-optional-but-recommended

Also, is that ERROR indexing label from a POST /indices or just from you ingesting the block data?

DeadpanZiao commented 1 year ago

Thanks for the reply! It does help. The error is from ingesting the block data.

Really appreciate it.