Closed AltOscar closed 1 year ago
Could you please specify the python version as well
On master nodes (CentOS 7) default python OS python version is 2.7.5, but it has up to python 3.6 installed On slave nodes (Debian 11) default python version is 3.9.2
Could this be it despite of the fact that it did copy 8TB without much of an issue?
Could this be it despite of the fact that it did copy 8TB without much of an issue?
We do not support a python2 and a python3 versions on slave and master. Please make sure you have same version on both and let us know if you still face any issue
Thank you for the quick response!
We changed python version on slave to match the one on master's.
Also we deleted all data on slave and started a new georep session from scratch. It went well for ~13 hours but we're getting the same error message:
[2022-09-29 07:16:46.154886] E [syncdutils(worker /bricks/brick1/data):325:log_raise_exception]
The following message can be found on the slave node 04:
[2022-09-29 07:40:09.772802] E [syncdutils(slave master_node_04/bricks/brick1/data):363:log_raise_exception]
What is the current version?
Current python version is 2.7.5 in all nodes, slave and master Current glusterfs version is 9.6 in all nodes aswell
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.
Description of problem: After copying ~8TB without any issue, some nodes are flipping between Active and Faulty with the following error message in gsync log: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128).
Default encoding in all machines is utf-8
command to reproduce the issue:
gluster volume georeplication master_vol user@slave_machine::slave_vol start
The full output of the command that failed: The command itself it's fine but you need to start it to fail, hence the command it's not the issue on it's own
Expected results: No such failures, copy should go as planned
Mandatory info: - The output of the
gluster volume info
command: Volume Name: volname Type: Distributed-Replicate Volume ID: d5a46398-9638-4b50-9db0-4cd7019fa526 Status: Started Snapshot Count: 0 Number of Bricks: 12 x 2 = 24 Transport-type: tcp Bricks: 24 bricks (omited the names cause not relevant and too large) Options Reconfigured: features.ctime: off cluster.min-free-disk: 15% performance.readdir-ahead: on server.event-threads: 8 cluster.consistent-metadata: on performance.cache-refresh-timeout: 1 diagnostics.client-log-level: WARNING diagnostics.brick-log-level: WARNING performance.flush-behind: off performance.cache-size: 5GB performance.cache-max-file-size: 1GB performance.io-thread-count: 32 performance.write-behind-window-size: 8MB client.event-threads: 8 network.inode-lru-limit: 1000000 performance.md-cache-timeout: 1 performance.cache-invalidation: false performance.stat-prefetch: on features.cache-invalidation-timeout: 30 features.cache-invalidation: off cluster.lookup-optimize: on performance.client-io-threads: on nfs.disable: on transport.address-family: inet storage.owner-uid: 33 storage.owner-gid: 33 features.bitrot: on features.scrub: Active features.scrub-freq: weekly cluster.rebal-throttle: lazy geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on- The output of the
gluster volume status
command:Don't really think this is relevant as everything seems fine, if needed i'll post it
- The output of the
gluster volume heal
command: Sames as before **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/Not the relevant ones as is georep, posting the exact issue: (this log is from master volume node)
[2022-09-23 09:53:32.565196] I [master(worker /bricks/brick1/data):1439:process] _GMaster: Entry Time Taken [{MKD=0}, {MKN=0}, {LIN=0}, {SYM=0}, {REN=0}, {RMD=0}, {CRE=0}, {duration=0.0000}, {UNL=0}] [2022-09-23 09:53:32.565651] I [master(worker /bricks/brick1/data):1449:process] _GMaster: Data/Metadata Time Taken [{SETA=0}, {SETX=0}, {meta_duration=0.0000}, {data_duration=1663926812.5656}, {DATA=0}, {XATT=0}] [2022-09-23 09:53:32.566270] I [master(worker /bricks/brick1/data):1459:process] _GMaster: Batch Completed [{changelog_end=1663925895}, {entry_stime=None}, {changelog_start=1663925895}, {stime=(0, 0)}, {duration=673.9491}, {num_changelogs=1}, {mode=xsync}] [2022-09-23 09:53:32.668133] I [master(worker /bricks/brick1/data):1703:crawl] _GMaster: processing xsync changelog [{path=/var/lib/misc/gluster/gsyncd/georepsession/bricks-brick1-data/xsync/XSYNC-CHANGELOG.1663926139}] [2022-09-23 09:53:33.358545] E [syncdutils(worker /bricks/brick1/data):325:log_raise_exception]: connection to peer is broken
[2022-09-23 09:53:33.358802] E [syncdutils(worker /bricks/brick1/data):847:errlog] Popen: command returned error [{cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-GcBeU5/38c083bada86a45a28e6710377e456f6.sock geoaccount@slavenode6 /usr/libexec/glusterfs/gsyncd slave mastervol geoaccount@slavenode1::slavevol --master-node masternode21 --master-node-id 08c7423e-c2b6-4d40-adc8-d2ded4f66608 --master-brick /bricks/brick1/data --local-node slavenode6 --local-node-id bc1b3971-50a7-4b32-a863-aaaa02419de6 --slave-timeout 120 --slave-log-level INFO --slave-gluster-log-level INFO --slave-gluster-command-dir /usr/sbin --master-dist-count 12}, {error=1}]
[2022-09-23 09:53:33.358927] E [syncdutils(worker /bricks/brick1/data):851:logerr] Popen: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128).
[2022-09-23 09:53:33.672739] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2022-09-23 09:53:45.477905] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
**- Is there any crash ? Provide the backtrace and coredump Provided log up
Additional info: Master volume: 12x2 Distributed-replicated setup, been working for a couple years no, no big issues as of today. 160TB of Data Slave volume: 2x(5+1) Distributed-disperse setup, created exclusively to be a slave georep node. Managed to copy 11TB of data from master node, but it's failing.
- The operating system / glusterfs version: On ALL nodes: Glusterfs version= 9.6 Master nodes OS: CentOS 7 Slave nodes OS: Debian11
Extra questions: Don't really know if it's the place to ask this, but while we're at it, any guidance as of how to improve sync performance? Tried changing the parameter sync_jobs up to 9 (from 3) but as we've seen (while it was working) it'd only copy from 3 nodes max, at a "low" speed (about 40% of our bandwidth). It could go as high as 1Gbps but the max we got was 370Mbps. Also, is there any in-depth documentation for georep? The basics we found were too basic and we did miss more doc to read and dig up into.
Thank you all for the help, will try to respond with anything you need asap.
Please bear with my English, not my mother tongue
Best regards