seaweedfs / seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
https://seaweedfs.com
Apache License 2.0
22.32k stars 2.25k forks source link

Incorrect calculation for maxVolumes when preAllocate is used #5778

Open danfoster opened 2 months ago

danfoster commented 2 months ago

Describe the bug

When using maxVol=0 for auto-max volume detection, along with preAllocate, the volume service can incorrectly determine the maxVolumes.

In the following example, a simple set-up has been created to reproduce this issue:

  1. 1 Volume service with one 32GB disk
  2. VolumeSize set to 500MB
  3. MaxVol set to 0
  4. Preallocate set

Before any volumes exist, it's correctly reporting the maxVol=63:

I0714 15:15:34.071764 store.go:600 disk /disks/disk1 max 63 unclaimedSpace:32490MB, unused:0MB volumeSizeLimit:500MB

Then if I create 10 volumes using

curl http://localhost:9333/vol/grow?count=10&replication=000

I see that my maxVolume has decreased by 10:

I0714 15:45:39.071605 disk_location.go:430 Volume stats for 4: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071615 disk_location.go:430 Volume stats for 7: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071621 disk_location.go:430 Volume stats for 2: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071629 disk_location.go:430 Volume stats for 3: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071635 disk_location.go:430 Volume stats for 6: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071641 disk_location.go:430 Volume stats for 8: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071647 disk_location.go:430 Volume stats for 9: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071653 disk_location.go:430 Volume stats for 10: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071659 disk_location.go:430 Volume stats for 1: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071664 disk_location.go:430 Volume stats for 5: volumeSizeLimit=524288000, datSize=8 idxSize=0 unused=524287992
I0714 15:45:39.071670 store.go:600 disk /disks/disk1 max 53 unclaimedSpace:22490MB, unused:4999MB volumeSizeLimit:500MB

Expected behaviour

Max volumes should stay at their original value.

Additional context

This happens because unclaimedSpaces is calculated as the freeSpace minus the unUsed space, which is correct for volumes that are not preAllocated: https://github.com/seaweedfs/seaweedfs/blob/b62f7c512267cfe379100fa283bbe4b0682e5dc9/weed/storage/store.go#L592 We need a solution that can handle a mixture of preAllocated and notAllocated volumes.

I believe we need a way of reporting of the true "on-disk" size of a volume, so we can use this to calculate if it's < maxVolume Size to determine if we need to subtract it from the freeSpace or not.

I see 2 ways we might go about reporting the disk usage size of the volume:

1) Don't set FALLOC_FL_KEEP_SIZE on the fallocate syscall, this causes the result of stat.Size() to return the on-disk size, not necessarily the amount of space used by weed data. Therefore we would need to be able to determine the usedSpace by weed data another way. This has the nice side effect of being more obvious of where disk space is being used when looking at the dataFiles outside of seaweedfs.

2) Somehow calculate the on-disk size of the volume and store that (e.g. the equivalent of "du -h 1.dat". My worry here is how expensive that operation will be over of large number of volumes if we have to repeat it. Maybe it's possible to record this once at volume loading time? Then only update it if datSize gets bigger than it?

I starting looking at option #1 and quickly realised there are a number of components that rely on knowing the weed data size per volume. Therefore I decided to pause and start this conversation to determine the correct approach before continuing: https://github.com/seaweedfs/seaweedfs/compare/master...danfoster:seaweedfs:maxVol_preAllocate_fallocate

chrislusf commented 2 months ago

1 sounds good. What are other places that needs the data size? Possible to limit the impact scope to jus the max volume count calculation?