gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

Rsync fails with 'File Exists' error #1911

Closed Naranderan closed 3 years ago

Naranderan commented 3 years ago

Description of problem:

  1. Our volume has millions of small files and most(~95%) of the files have hard links. To extend the volume, force rebalance was started.
  2. During the rebalance, the rsync process which writes to FUSE client started to fail with 'File Exists' error which still persists today even after the rebalance was completed.
  3. According to our analysis, we found stale 'linkto' files that need not to be present there.

Note: The steps to reproduce the issue are added in the comments.

Our analysis: prod-vol-rep3 - .current_year_bkp (data file) prod-vol-rep0 - .current_year_bkp (linkto file which is stale) and .current_year_bkp.ij2fOq (created by rsync)

In the above setup, the rename fop(.current_year_bkp.ij2fOq -> .current_year_bkp) proceeds like below: Step 1: In rep3 - Creating 'linkto' file for .current_year_bkp.ij2fOq - Successful. Step 2: In rep0 - 'ln .current_year_bkp.ij2fOq .current_year_bkp' - Fails because of the presence of stale 'linkto' file.

Needed clarifications:

  1. How can we prevent Rsync failings? Is there any way (mount option or some volume configuration) to enable 'force link(i.e rm stale ptr & link call)' in the FUSE client?
  2. What could be the possible reason for the creation of stale 'linkto' files?

The exact command to reproduce the issue: N/A

The full output of the command that failed:

N/A **Expected results:**
File should be renamed without any issues which were happening before starting the rebalance. **Mandatory info:** **- The output of the `gluster volume info` command**: Volume Name: prod-vol Type: Distributed-Replicate Volume ID: a0145bc0-7292-4334-bb0b-4f0eb401e79f Status: Started Snapshot Count: 0 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: 10.47.8.153:/home/sas/gluster/data/prod-vol Brick2: 10.47.8.152:/home/sas/gluster/data/prod-vol Brick3: 10.47.8.151:/home/sas/gluster/data/prod-vol Brick4: 10.47.8.154:/home/sas/gluster/data/prod-vol Brick5: 10.47.8.155:/home/sas/gluster/data/prod-vol Brick6: 10.47.8.156:/home/sas/gluster/data/prod-vol Brick7: 10.47.8.153:/disk1/data/glusterfs/prod-vol/subvol_003/brick Brick8: 10.47.8.152:/disk1/data/glusterfs/prod-vol/subvol_003/brick Brick9: 10.47.8.151:/disk1/data/glusterfs/prod-vol/subvol_003/brick Brick10: 10.47.8.154:/disk1/data/glusterfs/prod-vol/subvol_004/brick Brick11: 10.47.8.155:/disk1/data/glusterfs/prod-vol/subvol_004/brick Brick12: 10.47.8.156:/disk1/data/glusterfs/prod-vol/subvol_004/brick Options Reconfigured: diagnostics.client-log-level: INFO features.read-only: disable changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on performance.client-io-threads: off nfs.disable: on transport.address-family: inet diagnostics.brick-log-level: INFO **- The output of the `gluster volume status` command**: ``` Status of volume: prod-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.47.8.153:/home/sas/gluster/data/co de-ide 49152 0 Y 195941 Brick 10.47.8.152:/home/sas/gluster/data/co de-ide 49153 0 Y 71068 Brick 10.47.8.151:/home/sas/gluster/data/co de-ide 49152 0 Y 235655 Brick 10.47.8.154:/home/sas/gluster/data/co de-ide 49152 0 Y 67208 Brick 10.47.8.155:/home/sas/gluster/data/co de-ide 49153 0 Y 51372 Brick 10.47.8.156:/home/sas/gluster/data/co de-ide 49153 0 Y 266391 Brick 10.47.8.153:/disk1/data/glusterfs/cod e-ide/subvol_003/brick 49153 0 Y 143788 Brick 10.47.8.152:/disk1/data/glusterfs/cod e-ide/subvol_003/brick 49152 0 Y 347825 Brick 10.47.8.151:/disk1/data/glusterfs/cod e-ide/subvol_003/brick 49153 0 Y 18580 Brick 10.47.8.154:/disk1/data/glusterfs/cod e-ide/subvol_004/brick 49153 0 Y 67216 Brick 10.47.8.155:/disk1/data/glusterfs/cod e-ide/subvol_004/brick N/A N/A Y 320717 Brick 10.47.8.156:/disk1/data/glusterfs/cod e-ide/subvol_004/brick 49152 0 Y 51923 Self-heal Daemon on localhost N/A N/A N N/A Self-heal Daemon on 10.47.8.154 N/A N/A Y 208651 Self-heal Daemon on 10.47.8.155 N/A N/A N N/A Self-heal Daemon on 10.47.8.152 N/A N/A N N/A Self-heal Daemon on 10.47.8.156 N/A N/A Y 266406 Self-heal Daemon on 10.47.8.153 N/A N/A N N/A Task Status of Volume prod-vol ------------------------------------------------------------------------------ There are no active volume tasks ``` **- The output of the `gluster volume heal` command**: Brick 10.47.8.153:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.152:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.151:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.154:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.155:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.156:/home/sas/gluster/data/prod-vol Status: Connected Number of entries: 0 Brick 10.47.8.153:/disk1/data/glusterfs/prod-vol/subvol_003/brick Status: Connected Number of entries: 0 Brick 10.47.8.152:/disk1/data/glusterfs/prod-vol/subvol_003/brick Status: Connected Number of entries: 0 Brick 10.47.8.151:/disk1/data/glusterfs/prod-vol/subvol_003/brick Status: Connected Number of entries: 0 Brick 10.47.8.154:/disk1/data/glusterfs/prod-vol/subvol_004/brick Status: Connected Number of entries: 0 Brick 10.47.8.155:/disk1/data/glusterfs/prod-vol/subvol_004/brick Status: Connected Number of entries: 0 Brick 10.47.8.156:/disk1/data/glusterfs/prod-vol/subvol_004/brick Status: Connected Number of entries: 0 **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ [TRACE level Client log](https://github.com/gluster/glusterfs/files/5671269/github_issue_submitted_logs_client_TRACE_level.log) In server brick logs, these statements are printed - ``` [server-rpc-fops_v2.c:1132:server4_link_cbk] 0-code-ide-server: 8924763460: LINK /workspace/649951035/341/34118b539476cc844f24e0df2752b47c28d1379362d54686a1793bfda4abbb2//docs/.current_year_bkp.ij2fOq () -> 9afd549b-5faa-4db0-901e-684e310153fe/.eslintignore, client: CTX_ID:a8a8a6d6-d583-4fec-acb5-cf540b5d553e-GRAPH_ID:0-PID:12739-HOST:10.47.8.178-PC_NAME:code-ide-client-2-RECON_NO:-0, error-xlator: - [File exists] ``` **- Is there any crash ? Provide the backtrace and coredump No crash. **Additional info:**
**- The operating system / glusterfs version**: Server: Glusterfs - 7.8 & OS - CentOS 7.6 Client: Glusterfs - 7.7 & OS - CentOS 7.6 **Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration**
Naranderan commented 3 years ago

Any insights on this would be helpful to us.

Naranderan commented 3 years ago

This issue can be reproduced in a test setup. Steps to reproduce:

  1. Create a volume(2x1)
  2. Run 'data_populator.sh' from the FUSE client to populate data - for eg, 'sh data_populator.sh '
  3. Add a brick to the volume so now the volume will be 3x1.
  4. Run force rebalance and 'sh data_appender.sh ' parallelly. There will be only 50 files in the volume so the rebalance will complete soon. To reproduce the issue it's better to minimize the delay between executing the two commands.
  5. The stale sticky pointers will present in anyone of the replicates. Once the stale sticky pointers are identified, issue the below command from a FUSE client to get 'File Exists' error - 'mv <a normal file in the same replica of stale sticky pointer> <stale sticky pointer name>. Assume, '.config' has a stale sticky pointer and this stale sticky ptr is present in replicate0 and there is another normal file('.config11') which is present in the same replicate0 then this move command from a FUSE client will fail with 'File Exists' - mv .config11 .config

data_populator.sh.txt data_appender.sh.txt

stale[bot] commented 3 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 3 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.