Closed brettkettering closed 7 years ago
We have a rebuild utility implemented and tested that handles the following situations:
The rebuild utility can also be used to scan all of the objects in a component and verify their checksums, rebuilding those that are damaged as it goes. It is unaware of the marfs metadata, and cannot identify marfs files that are damaged, and I am not sure why we need to link the damaged objects back to their namespace.
f8a10b36 adds a work-around for the NFS root squash problems. (NFS exports with no_root_squash
give permission denied for programs with real uid 0 and effective uid non-zero). With this fix the scatter directories should be owned by a storage admin user, and the rebuilder can be invoked with -u storageadmin
to de-escalate and run the rebuild as a user that has read permission in the scatter dirs.
With this change the rebuilder is functionally complete. We should discuss performance targets and whether they have been met before closing the issue.
For a single node doing a rebuild with 16 threads we see 1 object per second which corresponds to approximately 1.2GB/s read from all zpools. This is adequate performance for now.
When a capability unit fails, we need to figure out what files in what namespaces are affected and reconstruct the missing parts.
Garret finished adding support to the library and Will has a new version that will rebuild all objects affected by a given component failure (the admin will have to specify which component failed). It now outputs failure statistics from the log-based rebuilds to tell admins where a lot of degraded objects were found. Also, it is multi-threaded so we should (hopefully) see decent performance with objects spread across multiple file systems/servers/jobds.
We may need a recursive checksum checker too. Check with the admin folks to see what else might be needed to ensure data integrity and reconstruct in the case of failure.