koverstreet / bcachefs

Other
708 stars 73 forks source link

Rebalance on each reboot #207

Open nicman23 opened 3 years ago

nicman23 commented 3 years ago

On each reboot the system takes a few hours in rebalance to stop having constant io. It is a newly formated fs but i did have some power failures (yay dead ups). Performance is ok but a bit concerning to have 3-4 hours of hdd io thrashing per day.

Not really sure what more you need to debug so please tell me.

External UUID:                  577f448a-026d-4c87-98e8-9afca32fd206
Internal UUID:                  ca184a21-ad20-4785-9643-770ee5622e30
Device index:                   0
Label:
Version:                        13
Oldest version on disk:         13
Created:                        Tue Mar 30 10:18:26 2021
Squence number:                 951
Block_size:                     4.0K
Btree node size:                256.0K
Error action:                   ro
Clean:                          1
Features:                       lz4,atomic_nlink,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Metadata replicas:              2
Data replicas:                  1
Metadata checksum type:         crc32c (1)
Data checksum type:             none (0)
Compression type:               none (0)
Foreground write target:        Group 0 (nvmes)
Background write target:        Group 1 (hdds)
Promote target:                 Group 0 (nvmes)
Metadata target:                none
String hash type:               siphash (2)
32 bit inodes:                  0
GC reserve percentage:          8%
Root reserve percentage:        0%
Devices:                        4 live, 4 total
Sections:                       journal,members,replicas_v0,disk_groups,clean,journal_seq_blacklist
Superblock size:                69240

Members (size 232):
  Device 0:
    UUID:                       2c085c31-b9be-43a2-9356-807c0564f8a5
    Size:                       320.0G
    Bucket size:                256.0K
    First bucket:               0
    Buckets:                    1310724
    Last mount:                 Sun Apr  4 22:08:46 2021
    State:                      rw
    Group:                      nvmes (0)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0
  Device 1:
    UUID:                       0427fcd5-d672-4273-8676-415a2a26cad3
    Size:                       320.0G
    Bucket size:                256.0K
    First bucket:               0
    Buckets:                    1310724
    Last mount:                 Sun Apr  4 22:08:46 2021
    State:                      rw
    Group:                      nvmes (0)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0
  Device 2:
    UUID:                       c5787d6c-5aac-4f68-b3ca-ec0d801c9a06
    Size:                       2.7T
    Bucket size:                512.0K
    First bucket:               0
    Buckets:                    5723176
    Last mount:                 Sun Apr  4 22:08:46 2021
    State:                      rw
    Group:                      hdds (1)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0
  Device 3:
    UUID:                       af44bf70-5b7c-4c7b-bcf6-e097933a5ffb
    Size:                       2.7T
    Bucket size:                512.0K
    First bucket:               0
    Buckets:                    5723176
    Last mount:                 Sun Apr  4 22:08:46 2021
    State:                      rw
    Group:                      hdds (1)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0
marcin-github commented 3 years ago

I have the same observation, bch-rebalance only reads and doesn't do writes.

koverstreet commented 3 years ago

How much is it reading? More than just metadata?

I'll add a stat for "read but didn't write data"

nicman23 commented 3 years ago

on my array it caps on the read of a single hdd

also yes it only reads for me too.

nicman23 commented 3 years ago

also not sure if relevant to the issue but my system has 9G allocated out of 16G total RAM with basically nothing running - also no hugepages.

Miizuuha commented 3 years ago

I see a lot of cache usage in gnome system monitor, perhaps that's what you're seeing as well? I know it only happens with bcachefs, but AFAIK every thing else takes precedence. It does some weird stuff with my graphics drivers, so I would suspect a possible memory leak or just overuse of kernel memory space.

nicman23 commented 3 years ago

i was not running a DE only getty at that time. it was not cache it was memory used from kernel processes.

Miizuuha commented 3 years ago

Something I did notice @nicman23, is that we're both using a multi-device setup in a RAID 0-esque mode with metadata replicas at 2, and that we're both using NVMEs and HDDs in their own groups. (I'm using erasure_coding though) I don't have the problem you do with boot rebalances, and aside from some data loss/corruption from bad resets and fsck (just pressing y to everything), everything else has been stable. Note that my setup is only a few weeks old, I keep going back and forth from maining bcachefs and then not because of some bug. But I think the kernel memory usage/leak issue may be related.

nicman23 commented 3 years ago

also i did have checksums disabled for data and no replicas for data

so raid0 for data and raid1 for metadata

marcin-github commented 3 years ago

How much is it reading? More than just metadata?

I'll add a stat for "read but didn't write data"

Size:               1209119193600
Used:               1108305899520
Online reserved:         1048576

Data type       Required/total  Devices
btree:          1/2             [dm-6 dm-1]                      12942049280
user:           1/1             [dm-6]                           555509760
user:           1/1             [dm-1]                           1094001182720
cached:         1/1             [dm-6]                           85869757952

hdd (device 1):                               dm-1              rw
                                data         buckets      fragmented
  sb:                         135168               1          389120
  journal:                 536870912            1024               0
  btree:                  6471024640           19717      3866361856
  user:                1094001182720         2095147      4459247616
  cached:                          0               0               0
  parity:                          0               0               0
  erasure coded:                   0               0               0
  available:             97550598144          186063
  capacity:            1206885810176         2301952

ssd (device 0):                               dm-6              rw
                                data         buckets      fragmented
  sb:                         135168               1          126976
  journal:                 268435456            1024               0
  btree:                  6471024640           24685               0
  user:                    555509760            2122          759808
  cached:                85869757952          331466      1022065152
  parity:                          0               0               0
  erasure coded:                   0               0               0
  available:            100078190592          381768
  capacity:             107374182400          409600
# iostat -h /dev/dm-6 /dev/dm-1
Linux 5.10.0-00651-gc37f6ec112e9 (x)    16.04.2021      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8,8%    0,0%   13,7%    7,3%    0,0%   70,2%

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
  1463,94        25,6M       209,5k         0,0k      47,2G     385,6M       0,0k dm-1
   205,33         5,0M         5,7k       988,8k       9,2G      10,5M       1,8G dm-6

Also noticed that "Online reserved" constantly changes, going up to e.g.3018752 and going down to e.g. 8192.

clonejo commented 3 years ago

Also observing this on two fresh filesystems, one fs only on an SSD (64GiB), the other on the same SSD (192GiB)+an HDD (1.5TB). Two kernel threads are ongoing, bch-rebalance/d and bch-rebalance/f, with ~250MB/s and ~70MB/s of reads, respectively. I have not observed any writes. In one instance, the slower rebalance thread finished earlier than the other.

contents of /sys/fs/bcachefs/*/internal/rebalance_*:

```shell # for file in /sys/fs/bcachefs/*/internal/rebalance_*; do echo $file; cat $file; echo; done /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_enabled 1 /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_rate 200k /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_rate_bytes 204800 /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_rate_debug rate: 200k/sec target: 0 actual: 0 proportional: 0 derivative: 0 change: 0/sec next io: -14655776ms /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_rate_d_term 30 /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_rate_p_term_inverse 6000 /sys/fs/bcachefs/eeee380c-a262-4d6d-89fa-4040a38172a2/internal/rebalance_work fullest_dev (-1): 0/58.8G total work: 160k/58.8G rate: 204800 running pos 6917529027641096294:5347272:U32_MAX /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_enabled 1 /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_rate 22.2k /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_rate_bytes 22755 /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_rate_debug rate: 22.2k/sec target: 0 actual: 0 proportional: 0 derivative: 0 change: 0/sec next io: -14655788ms /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_rate_d_term 30 /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_rate_p_term_inverse 6000 /sys/fs/bcachefs/fed3ec1f-9bdc-44fd-8569-02ab01841e2a/internal/rebalance_work fullest_dev (-1): 0/1.4T total work: 5.0G/1.4T rate: 22755 running pos 4611686018427417644:0:U32_MAX ``` Right now, the larger fs has some slow writing going on, but the smaller fs is pretty much at rest.

There might also be a problem with fsck, when i ran that on the larger fs i think there was no output after the line about the fs being clean. And so i cancelled it after a minute or so. (i did not retry with current kernel and tools build)

Kernel is ac3ab6a511717db1644ded49a6f417304abba048, 5.11.21-1-bcachefs-git-323068-gac3ab6a51171 on Arch Linux (linux-bcachefs-git from AUR)

Edit: i've found that the rebalance thread can come back without a reboot, though there was a suspend in between.

clonejo commented 3 years ago

Still occuring on 8ea0358a147c09a608a83bff2beba9729525ccb5 and 05ed308a527fd595910dcfe5a67d9cac04490e37.

clonejo commented 3 years ago

Would it be helpful to monitor some values over time? Some of the values coming out of sysfs already, even? I am already collecting some stuff from bcachefs into a prometheus+grafana setup, so that might be helpful in troubleshooting.

edit: i'm now blanket-collecting most numbers in sysfs