DVID continuously restarts due to a missing data in the database

karthik-vm commented 5 months ago

I am new to DVID and I have set it up for my institution to store and version image data. It was working fine till yesterday while today I see that it is constantly restarting due to the below error:

ERROR Could not load repo-wide max label for instance "segmentation". Only got 0 bytes, not 64-bit label. ERROR Using max label across versions: 10000000000 ERROR Could not load repo-wide next label for instance "segmentation". No next label override of max label.

The full log is here: github-recovery-codes.txt

I also came to know that a user tried to load segmentation data into DVID yesterday and the issue seemed to start after this. I would really appreciate the following:

Any pointer on what might have caused and how to fix it
Is there a way to checkpoint my database and restore it to a previous date?
Other than this github are there any forum to discuss DVID issues and get some support?

DocSavage commented 5 months ago

The link above to the full log is a bad link to recovery codes so I deleted it.

I'll start with (3) first: Github issues is preferred because the issues and their resolution (or not) will be helpful to others. You could also email me directly, particularly if you prefer not to publicly reveal certain information on the setup. I believe you have my Janelia email address.

(1) I'd need a lot more information to diagnose the cause. The error messages are unusual and suggest that the configuration for the DVID might be bad or incorrect metadata was written.

The info I would need:

What's your configuration TOML file look like? Can you provide the portion of the output logs that pertain to the failed startups?
Before the issue, what kind of data instances had you created? For example, did you have "grayscale" uint8blk and a "segmentation" labelmap instances? Was the loading of segmentation the first time segmentation was added or was their prior segmentation?
How long had it been running and had you done commits and/or backups before?

(2) In DVID, you can commit the data at any time. This essentially freezes the state so you can refer back to the state of the data at the given UUID. You can commit using the POST /api/node/{uuid}/commit and there are also endpoints for branch and newversion. Just visit your-server/api/help when DVID is running.

Depending on which embedded database you chose, you could use dvid-backup or just rsync directly on the database directory. If using the current default Badger backend, you can do online backups so never have to shutdown DVID. This works well for us and it's easy to make this a cron job so you can do incremental updates to some archival disk storage. Here's an example of a script that copies the mutation logs (both JSON and protobuf) and separate databases we use for metadata, labelmaps, meshes, and then everything else.

# should be done on older backup before copy just to minimize the rsync times.
# see: https://dgraph.io/docs/badger/get-started/#database-backup

set -o history
set -o histexpand

rsync -av --delete /data1/mutlogs/aso-json/ /data2/backups/aso/mutlogs-json
rsync -av --delete /data1/mutlogs/aso/ /data2/backups/aso/mutlogs

rsync -av --delete /data1/dbs/aso/metadata /data2/backups/aso
while !! | grep -q "(MANIFEST\|\.sst)$"; do :; done

rsync -av --delete /data1/dbs/aso/default /data2/backups/aso
while !! | grep -q "(MANIFEST\|\.sst)$"; do :; done

rsync -av --delete /data1/dbs/aso/labelmaps /data2/backups/aso
while !! | grep -q "(MANIFEST\|\.sst)$"; do :; done

rsync -av --delete /data1/dbs/aso/meshes /data2/backups/aso
while !! | grep -q "(MANIFEST\|\.sst)$"; do :; done

You can configure DVID to use separate databases for different classes of data types or even assign particular data instances to its own database. This has the advantage of being able to restore just portions of your dataset to previously backed-up copies even if you haven't made a commit.

DocSavage commented 3 months ago

Closed after direct communication with Karthik.

janelia-flyem / dvid

DVID continuously restarts due to a missing data in the database #393