Closed GoogleCodeExporter closed 9 years ago
This must be telephaty, yesterday while sketching ideas for lsyncd-2 this came
to my mind this has been forgotten along the way, and I should add in the
description that exactly due to this, lsyncd cannot guarantee data transfer.
Currently I'd suggest a cronjob that kill -HUPs lsyncd every 24 hours / x days,
to check if something might have been missed due to a network error.
When putting lsyncd into practice I discoverd unfortunally by surprise that
some rsync errors can happen inherintly in design, due to the fact that lsyncd
always lags behind the real file system. For example
* directory X is created
* lsyncd rsyncs creation of X.
* file X/r is created
* directory X is recursevly deleted.
* lsyncd tries to rsync X/r -- must fail!
I'm a little unsure how to handle this, since what to do if there are multiple
targets? Shouldn't maybe the unresponsiveness of one target be ignored and the
others continued to be fed?
Otherwise, could you maybe help me? This are the return values from the manpage
rsync can make. In which case what behaviour should be selected? (Possible
behaviours I see are: Continue, Retry, Restart, Die)
0 Success
1 Syntax or usage error
2 Protocol incompatibility
3 Errors selecting input/output files, dirs
4 Requested action not supported: an attempt was made to manipuâ
late 64-bit files on a platform that cannot support them; or an
option was specified that is supported by the client and not by
the server.
5 Error starting client-server protocol
6 Daemon unable to append to log-file
10 Error in socket I/O
11 Error in file I/O
12 Error in rsync protocol data stream
13 Errors with program diagnostics
14 Error in IPC code
20 Received SIGUSR1 or SIGINT
21 Some error returned by waitpid()
22 Error allocating core memory buffers
23 Partial transfer due to error
24 Partial transfer due to vanished source files
25 The --max-delete limit stopped deletions
30 Timeout in data send/receive
35 Timeout waiting for daemon connection
* What to do on unknown exit codes?
Original comment by axk...@gmail.com
on 13 Oct 2010 at 12:21
[deleted comment]
Restarting lsyncd hardly counts for workaround for me (24h is too long, and
killing it every
15 minutes would create more problems with partial transfers). Rsync errors
caused by FS changes
during transfer are hard to avoid (maybe some kind of cheap zfs/netapp-style
snapshots are the
way to go) but loss of network connectivity is different matter.
Multiple targets are hard. The only good solution is to keep per-target queue
of outstanding
transfers and re-try them in configurable intervals. Currently I have no plans
to use lsyncd
for multiple targets (being mostly interested in 1-1 long-distance replication)
so some simpler
solution will suffice -- e.g. if some targets are unreachable (and queue is not
empty) sleep
for configurable amount on time (instead of waiting for inotify forever) and
retry transfer.
Typical network-related rsync errors are
5 Error starting client-server protocol
10 Error in socket I/O
11 Error in file I/O
12 Error in rsync protocol data stream
plus, probably, other errors due to a temporary conditions like
3 Errors selecting input/output files, dirs
6 Daemon unable to append to log-file
14 Error in IPC code
20 Received SIGUSR1 or SIGINT
21 Some error returned by waitpid()
22 Error allocating core memory buffers
30 Timeout in data send/receive
35 Timeout waiting for daemon connection
I think those are good candidates for later retry.
2nd category are configuration-related errors like
1 Syntax or usage error
2 Protocol incompatibility
4 Requested action not supported
Those must be corrected by configuration changes, so target must be marked as
'dead' with
corresponding log record.
3rd category are errors related to FS activity during transfer
23 Partial transfer due to error
24 Partial transfer due to vanished source files
'Partial tranfer' usually means that file was changed while rsyncing (so retry
should fix it).
Vanished files are probably ok to ignore (as far as I remember rsync builds a
list of files to
transfer before actual run and complains if something changes during sync, but
those errors are
harmless).
Last category is artifical limits violation
25 The --max-delete limit stopped deletions
Those probably will not occur in lsyncd context, because there are no defaults
for max number of
deleted files and there is no sane reason to explicitly set this parameter in
config.
Original comment by some.h...@gmail.com
on 13 Oct 2010 at 2:49
Typical network-related rsync errors are
5 Error starting client-server protocol
10 Error in socket I/O
11 Error in file I/O
12 Error in rsync protocol data stream
---> So RETRY after say (DELAY) seconds?
plus, probably, other errors due to a temporary conditions like
3 Errors selecting input/output files, dirs
6 Daemon unable to append to log-file
14 Error in IPC code
20 Received SIGUSR1 or SIGINT
21 Some error returned by waitpid()
22 Error allocating core memory buffers
30 Timeout in data send/receive
35 Timeout waiting for daemon connection
I think those are good candidates for later retry.
---> So RETRY after say (DELAY) seconds?
2nd category are configuration-related errors like
1 Syntax or usage error
2 Protocol incompatibility
4 Requested action not supported
Those must be corrected by configuration changes, so target must be marked as
'dead' with
corresponding log record.
---> Die?
3rd category are errors related to FS activity during transfer
23 Partial transfer due to error
24 Partial transfer due to vanished source files
'Partial tranfer' usually means that file was changed while rsyncing (so retry
should fix it).
Vanished files are probably ok to ignore (as far as I remember rsync builds a
list of files to
transfer before actual run and complains if something changes during sync, but
those errors are
harmless).
---> This is a good candita for CONTINUE. Id just tested, if the source does
not exist, it raises 23 level.
Last category is artifical limits violation
25 The --max-delete limit stopped deletions
Those probably will not occur in lsyncd context, because there are no defaults
for max number of
deleted files and there is no sane reason to explicitly set this parameter in
config.
---> I suppose, DIE (if a user configured it out of some reason, I suppose he
does not want to delete the files then)
All others undocument or future one --- DIE? Since something should be
reconfigured.
About seperate lists for multiple targets --- yes, I plan this for lsyncd-2,
but in the 1 series I don't see a good way to fit it in the current data
structures.
Original comment by axk...@gmail.com
on 13 Oct 2010 at 3:22
> > Typical network-related rsync errors are
> > 5 Error starting client-server protocol
> > 10 Error in socket I/O
> > 11 Error in file I/O
> > 12 Error in rsync protocol data stream
> ---> So RETRY after say (DELAY) seconds?
Yes -- keep file in queue and retry after some delay. Also,
- stop further resync attempts for DELAY seconds (only queue them)
- log message to syslog on 'warning' level (like 'error NN, waiting DELAY
seconds')
- disable target (until restart/HUP) after some (configurable) number of failed
attempts
or when number of outstanding transfers reached some (configurable) threshold
> > plus, probably, other errors due to a temporary conditions like
> > I think those are good candidates for later retry.
> ---> So RETRY after say (DELAY) seconds?
Yes. Those are mostly internal rsync errors (IPC code, malloc errors etc) and
should not
happen, but retry after a good DELAY could improve things a bit.
> > 2nd category are configuration-related errors like
> ---> Die?
Or disable this target until restart/HUP, with corresponding error message in
log
> > 3rd category are errors related to FS activity during transfer
> ---> This is a good candidate for CONTINUE. Id just tested, if the source
does not exist, it raises 23 level.
Yes, could be safely skipped.
> All others undocumented or future one --- DIE? Since something should be
reconfigured.
Probably yes (or disable target).
Original comment by some.h...@gmail.com
on 13 Oct 2010 at 4:35
According to the discussion group, different admins have different
understandings what to do so. So until a big recoding is made -- currently I
plan to use LUA as new configuration tool -- how about simply using this bash
rsync "driver"?. Simply replace the <binary> to point to this script.
----
#!/bin/bash
while [ 1 ]; do
/usr/bin/rsync "$@"
err=$?
case $err in
3|5|6|10|11|12|14|20|21|22|30|35)
# network error, retry
echo rdriver: retry on $err
sleep 5
;;
1|2|4|25)
# kill parent (lysncd)
echo rdriver: kill on $err
kill $PPID
;;
0|23|24|*)
# done
echo rdriver: done on $err
break;
;;
esac
done
----
Kind regards, Axel
Original comment by axk...@gmail.com
on 14 Oct 2010 at 9:56
Ok, thanks. I'll try to use it for a time being.
Original comment by some.h...@gmail.com
on 14 Oct 2010 at 4:53
The problem I see with the bash workaround, is that is can potentially spawn a
huge number of processes while the mirror is down which could cause memory
exhaustion on the source or hammering of many parallel rsync processes on the
target, if it becomes available again.
Original comment by Felix.Bu...@gmail.com
on 8 Nov 2010 at 2:56
Btw. when using rsync-over-ssh a network error is usually indicated by exit
code 255.
Original comment by Felix.Bu...@gmail.com
on 8 Nov 2010 at 3:13
lsyncd 1.x will halt and wait while a process is running. so there is never
more processes than the number of targets. if one target is down and the bash
script from above is blocking it will halt all targets. If it recovers from a
longer period of time the inotify queue likely will have overflowed, in which
lsyncd does a restart to recursively sync what has been missed. I'm addressing
this issues in lsyncd 2.0 where multiple targets are treated seperately and
will not block each other. For a single target system, the workaround should
work just fine as one would expect.
Original comment by axk...@gmail.com
on 8 Nov 2010 at 7:26
Lsyncd 2.0 will repeat on network errors.
Original comment by axk...@gmail.com
on 27 Nov 2010 at 1:20
Original issue reported on code.google.com by
some.h...@gmail.com
on 13 Oct 2010 at 9:34