gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.74k stars 1.08k forks source link

Geo replication: UnicodeDecodeError #3830

Closed AltOscar closed 1 year ago

AltOscar commented 2 years ago

Description of problem: After copying ~8TB without any issue, some nodes are flipping between Active and Faulty with the following error message in gsync log: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128).

Default encoding in all machines is utf-8

command to reproduce the issue:

gluster volume georeplication master_vol user@slave_machine::slave_vol start

The full output of the command that failed: The command itself it's fine but you need to start it to fail, hence the command it's not the issue on it's own

Expected results: No such failures, copy should go as planned

Mandatory info: - The output of the gluster volume info command: Volume Name: volname Type: Distributed-Replicate Volume ID: d5a46398-9638-4b50-9db0-4cd7019fa526 Status: Started Snapshot Count: 0 Number of Bricks: 12 x 2 = 24 Transport-type: tcp Bricks: 24 bricks (omited the names cause not relevant and too large) Options Reconfigured: features.ctime: off cluster.min-free-disk: 15% performance.readdir-ahead: on server.event-threads: 8 cluster.consistent-metadata: on performance.cache-refresh-timeout: 1 diagnostics.client-log-level: WARNING diagnostics.brick-log-level: WARNING performance.flush-behind: off performance.cache-size: 5GB performance.cache-max-file-size: 1GB performance.io-thread-count: 32 performance.write-behind-window-size: 8MB client.event-threads: 8 network.inode-lru-limit: 1000000 performance.md-cache-timeout: 1 performance.cache-invalidation: false performance.stat-prefetch: on features.cache-invalidation-timeout: 30 features.cache-invalidation: off cluster.lookup-optimize: on performance.client-io-threads: on nfs.disable: on transport.address-family: inet storage.owner-uid: 33 storage.owner-gid: 33 features.bitrot: on features.scrub: Active features.scrub-freq: weekly cluster.rebal-throttle: lazy geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on

- The output of the gluster volume status command:

Don't really think this is relevant as everything seems fine, if needed i'll post it

- The output of the gluster volume heal command: Sames as before **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

Not the relevant ones as is georep, posting the exact issue: (this log is from master volume node)

[2022-09-23 09:53:32.565196] I [master(worker /bricks/brick1/data):1439:process] _GMaster: Entry Time Taken [{MKD=0}, {MKN=0}, {LIN=0}, {SYM=0}, {REN=0}, {RMD=0}, {CRE=0}, {duration=0.0000}, {UNL=0}] [2022-09-23 09:53:32.565651] I [master(worker /bricks/brick1/data):1449:process] _GMaster: Data/Metadata Time Taken [{SETA=0}, {SETX=0}, {meta_duration=0.0000}, {data_duration=1663926812.5656}, {DATA=0}, {XATT=0}] [2022-09-23 09:53:32.566270] I [master(worker /bricks/brick1/data):1459:process] _GMaster: Batch Completed [{changelog_end=1663925895}, {entry_stime=None}, {changelog_start=1663925895}, {stime=(0, 0)}, {duration=673.9491}, {num_changelogs=1}, {mode=xsync}] [2022-09-23 09:53:32.668133] I [master(worker /bricks/brick1/data):1703:crawl] _GMaster: processing xsync changelog [{path=/var/lib/misc/gluster/gsyncd/georepsession/bricks-brick1-data/xsync/XSYNC-CHANGELOG.1663926139}] [2022-09-23 09:53:33.358545] E [syncdutils(worker /bricks/brick1/data):325:log_raise_exception] : connection to peer is broken [2022-09-23 09:53:33.358802] E [syncdutils(worker /bricks/brick1/data):847:errlog] Popen: command returned error [{cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-GcBeU5/38c083bada86a45a28e6710377e456f6.sock geoaccount@slavenode6 /usr/libexec/glusterfs/gsyncd slave mastervol geoaccount@slavenode1::slavevol --master-node masternode21 --master-node-id 08c7423e-c2b6-4d40-adc8-d2ded4f66608 --master-brick /bricks/brick1/data --local-node slavenode6 --local-node-id bc1b3971-50a7-4b32-a863-aaaa02419de6 --slave-timeout 120 --slave-log-level INFO --slave-gluster-log-level INFO --slave-gluster-command-dir /usr/sbin --master-dist-count 12}, {error=1}] [2022-09-23 09:53:33.358927] E [syncdutils(worker /bricks/brick1/data):851:logerr] Popen: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128). [2022-09-23 09:53:33.672739] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}] [2022-09-23 09:53:45.477905] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]

**- Is there any crash ? Provide the backtrace and coredump Provided log up

Additional info: Master volume: 12x2 Distributed-replicated setup, been working for a couple years no, no big issues as of today. 160TB of Data Slave volume: 2x(5+1) Distributed-disperse setup, created exclusively to be a slave georep node. Managed to copy 11TB of data from master node, but it's failing.

- The operating system / glusterfs version: On ALL nodes: Glusterfs version= 9.6 Master nodes OS: CentOS 7 Slave nodes OS: Debian11

Extra questions: Don't really know if it's the place to ask this, but while we're at it, any guidance as of how to improve sync performance? Tried changing the parameter sync_jobs up to 9 (from 3) but as we've seen (while it was working) it'd only copy from 3 nodes max, at a "low" speed (about 40% of our bandwidth). It could go as high as 1Gbps but the max we got was 370Mbps. Also, is there any in-depth documentation for georep? The basics we found were too basic and we did miss more doc to read and dig up into.

Thank you all for the help, will try to respond with anything you need asap.

Please bear with my English, not my mother tongue

Best regards

Shwetha-Acharya commented 2 years ago

Could you please specify the python version as well

AltOscar commented 2 years ago

On master nodes (CentOS 7) default python OS python version is 2.7.5, but it has up to python 3.6 installed On slave nodes (Debian 11) default python version is 3.9.2

Could this be it despite of the fact that it did copy 8TB without much of an issue?

Shwetha-Acharya commented 2 years ago

Could this be it despite of the fact that it did copy 8TB without much of an issue?

We do not support a python2 and a python3 versions on slave and master. Please make sure you have same version on both and let us know if you still face any issue

AltOscar commented 2 years ago

Thank you for the quick response!

We changed python version on slave to match the one on master's.

Also we deleted all data on slave and started a new georep session from scratch. It went well for ~13 hours but we're getting the same error message:

[2022-09-29 07:16:46.154886] E [syncdutils(worker /bricks/brick1/data):325:log_raise_exception] : connection to peer is broken [2022-09-29 07:16:46.155607] E [syncdutils(worker /bricks/brick1/data):847:errlog] Popen: command returned error [{cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-9K4Fob/1caaca939375a14a7509312ce698bd03.sock slave_node_04 /usr/libexec/glusterfs/gsyncd slave mastervol georep::session --master-node master_node_25 --master-node-id b2af376f-8673-4cca-9b06-6db65b559118 --master-brick /bricks/brick1/data --local-node slave_node_04 --local-node-id 9fe2fd13-dd3a-498d-ad56-09c8f3ce2bac --slave-timeout 120 --slave-log-level INFO --slave-gluster-log-level INFO --slave-gluster-command-dir /usr/sbin --master-dist-count 12}, {error=1}] [2022-09-29 07:16:46.155893] E [syncdutils(worker /bricks/brick1/data):851:logerr] Popen: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 63: ordinal not in range(128). [2022-09-29 07:16:46.558873] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}] [2022-09-29 07:16:58.372021] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}] [2022-09-29 07:16:58.372637] I [monitor(monitor):160:monitor] Monitor: starting gsyncd worker [{brick=/bricks/brick1/data}, {slave_node=slave_node_04}]

The following message can be found on the slave node 04:

[2022-09-29 07:40:09.772802] E [syncdutils(slave master_node_04/bricks/brick1/data):363:log_raise_exception] : FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 393, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1163, in t = syncdutils.Thread(target=lambda: (repce.service_loop(), File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 94, in service_loop self.q.put(recv(self.inf)) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 63, in recv return pickle.load(inf.buffer) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 63: ordinal not in range(128)

Shwetha-Acharya commented 2 years ago

What is the current version?

AltOscar commented 2 years ago

Current python version is 2.7.5 in all nodes, slave and master Current glusterfs version is 9.6 in all nodes aswell

stale[bot] commented 1 year ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 1 year ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.