rsync IO error (due to unreachable peer) erroneously treated as success/done

GoogleCodeExporter commented 9 years ago

lsync 1.39 ignores error codes from rsync while syncing dirs, so if remote 
rsyncd is unreachable directory never gets synced.

Problem is very easy to reproduce
- configure lsyncd to watch some dir, check updates are propagated OK
- stop rsyncd on remote side
- make changes to local dir, watch something like this in debug log
  select() timeout, doing delays.
  in queue: 1 expired / 0 delayed dirs
  delay expired: acting for /apps/samba/shares/deps/novgorod/out/.
  calling /usr/bin/rsync -ltd --delete --bwlimit=100 --timeout=60 --password-file=/home/transf-novgorod/.rsyncpass --exclude-from /home/transf-novgorod/lsyncd.exclusions /apps/samba/shares/deps/novgorod/out/ spb@hawk-n1.n1.maxidom.ru::from-spb/
    /usr/bin/rsync /apps/samba/shares/deps/novgorod/out/ --> spb@hawk-n1.n1.maxidom.ru::from-spb/ [7756]
  rsync: failed to connect to hawk-n1.n1.maxidom.ru: Connection refused (111)
  rsync error: error in socket IO (code 10) at clientserver.c(98)
  Forked binary process returned non-zero return code: 10
  Finished waiting for all children
  in queue: 0 expired / 0 delayed dirs
  gone blocking
- so changes are lost

This severely limits reliablity of lsyncd as mirroring solution.

Original issue reported on code.google.com by some.h...@gmail.com on 13 Oct 2010 at 9:34

GoogleCodeExporter commented 9 years ago

This must be telephaty, yesterday while sketching ideas for lsyncd-2 this came 
to my mind this has been forgotten along the way, and I should add in the 
description that exactly due to this, lsyncd cannot guarantee data transfer. 
Currently I'd suggest a cronjob that kill -HUPs lsyncd every 24 hours / x days, 
to check if something might have been missed due to a network error. 

When putting lsyncd into practice I discoverd unfortunally by surprise that 
some rsync errors can happen inherintly in design, due to the fact that lsyncd 
always lags behind the real file system. For example 
* directory X is created
* lsyncd rsyncs creation of X.
* file X/r is created
* directory X is recursevly deleted.
* lsyncd tries to rsync X/r -- must fail!

I'm a little unsure how to handle this, since what to do if there are multiple 
targets? Shouldn't maybe the unresponsiveness of one target be ignored and the 
others continued to be fed?

Otherwise, could you maybe help me? This are the return values from the manpage 
rsync can make. In which case what behaviour should be selected? (Possible 
behaviours I see are: Continue, Retry, Restart, Die)

       0      Success
       1      Syntax or usage error
       2      Protocol incompatibility
       3      Errors selecting input/output files, dirs
       4      Requested  action  not supported: an attempt was made to manipuâ
              late 64-bit files on a platform that cannot support them; or  an
              option  was specified that is supported by the client and not by
              the server.
       5      Error starting client-server protocol
       6      Daemon unable to append to log-file
       10     Error in socket I/O
       11     Error in file I/O
       12     Error in rsync protocol data stream
       13     Errors with program diagnostics
       14     Error in IPC code
       20     Received SIGUSR1 or SIGINT
       21     Some error returned by waitpid()
       22     Error allocating core memory buffers
       23     Partial transfer due to error
       24     Partial transfer due to vanished source files
       25     The --max-delete limit stopped deletions
       30     Timeout in data send/receive
       35     Timeout waiting for daemon connection
        *     What to do on unknown exit codes?

Original comment by axk...@gmail.com on 13 Oct 2010 at 12:21

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago


Restarting lsyncd hardly counts for workaround for me (24h is too long, and 
killing it every 
15 minutes would create more problems with partial transfers). Rsync errors 
caused by FS changes
during transfer are hard to avoid (maybe some kind of cheap zfs/netapp-style 
snapshots are the
way to go) but loss of network connectivity is different matter. 

Multiple targets are hard. The only good solution is to keep per-target queue 
of outstanding 
transfers and re-try them in configurable intervals. Currently I have no plans 
to use lsyncd
for multiple targets (being mostly interested in 1-1 long-distance replication) 
so some simpler
solution will suffice -- e.g. if some targets are unreachable (and queue is not 
empty) sleep
for configurable amount on time (instead of waiting for inotify forever) and 
retry transfer.

Typical network-related rsync errors are 
  5      Error starting client-server protocol
  10     Error in socket I/O
  11     Error in file I/O
  12     Error in rsync protocol data stream
plus, probably, other errors due to a temporary conditions like
  3      Errors selecting input/output files, dirs
  6      Daemon unable to append to log-file
  14     Error in IPC code
  20     Received SIGUSR1 or SIGINT
  21     Some error returned by waitpid()
  22     Error allocating core memory buffers
  30     Timeout in data send/receive
  35     Timeout waiting for daemon connection
I think those are good candidates for later retry.

2nd category are configuration-related errors like
  1      Syntax or usage error
  2      Protocol incompatibility
  4      Requested  action  not supported
Those must be corrected by configuration changes, so target must be marked as 
'dead' with 
corresponding log record.

3rd category are errors related to FS activity during transfer
  23     Partial transfer due to error
  24     Partial transfer due to vanished source files
'Partial tranfer' usually means that file was changed while rsyncing (so retry 
should fix it).
Vanished files are probably ok to ignore (as far as I remember rsync builds a 
list of files to 
transfer before actual run and complains if something changes during sync, but 
those errors are
harmless).

Last category is artifical limits violation
  25     The --max-delete limit stopped deletions
Those probably will not occur in lsyncd context, because there are no defaults 
for max number of
deleted files and there is no sane reason to explicitly set this parameter in 
config.

Original comment by some.h...@gmail.com on 13 Oct 2010 at 2:49

GoogleCodeExporter commented 9 years ago

Typical network-related rsync errors are 
  5      Error starting client-server protocol
  10     Error in socket I/O
  11     Error in file I/O
  12     Error in rsync protocol data stream

---> So RETRY after say (DELAY) seconds?

plus, probably, other errors due to a temporary conditions like
  3      Errors selecting input/output files, dirs
  6      Daemon unable to append to log-file
  14     Error in IPC code
  20     Received SIGUSR1 or SIGINT
  21     Some error returned by waitpid()
  22     Error allocating core memory buffers
  30     Timeout in data send/receive
  35     Timeout waiting for daemon connection
I think those are good candidates for later retry.
---> So RETRY after say (DELAY) seconds?

2nd category are configuration-related errors like
  1      Syntax or usage error
  2      Protocol incompatibility
  4      Requested  action  not supported
Those must be corrected by configuration changes, so target must be marked as 
'dead' with 
corresponding log record.
---> Die?

3rd category are errors related to FS activity during transfer
  23     Partial transfer due to error
  24     Partial transfer due to vanished source files
'Partial tranfer' usually means that file was changed while rsyncing (so retry 
should fix it).
Vanished files are probably ok to ignore (as far as I remember rsync builds a 
list of files to 
transfer before actual run and complains if something changes during sync, but 
those errors are
harmless).
---> This is a good candita for CONTINUE. Id just tested, if the source does 
not exist, it raises 23 level.

Last category is artifical limits violation
  25     The --max-delete limit stopped deletions
Those probably will not occur in lsyncd context, because there are no defaults 
for max number of
deleted files and there is no sane reason to explicitly set this parameter in 
config.
---> I suppose, DIE (if a user configured it out of some reason, I suppose he 
does not want to delete the files then)

All others undocument or future one --- DIE? Since something should be 
reconfigured.

About seperate lists for multiple targets --- yes, I plan this for lsyncd-2, 
but in the 1 series I don't see a good way to fit it in the current data 
structures.

Original comment by axk...@gmail.com on 13 Oct 2010 at 3:22

GoogleCodeExporter commented 9 years ago

> > Typical network-related rsync errors are 
> >   5     Error starting client-server protocol
> >  10     Error in socket I/O
> >  11     Error in file I/O
> >  12     Error in rsync protocol data stream
> ---> So RETRY after say (DELAY) seconds?

Yes -- keep file in queue and retry after some delay. Also,
- stop further resync attempts for DELAY seconds (only queue them)
- log message to syslog on 'warning' level (like 'error NN, waiting DELAY 
seconds')
- disable target (until restart/HUP) after some (configurable) number of failed 
attempts 
  or when number of outstanding transfers reached some (configurable) threshold

> > plus, probably, other errors due to a temporary conditions like
> > I think those are good candidates for later retry.
> ---> So RETRY after say (DELAY) seconds?

Yes. Those are mostly internal rsync errors (IPC code, malloc errors etc) and 
should not
happen, but retry after a good DELAY could improve things a bit.

> > 2nd category are configuration-related errors like
> ---> Die?

Or disable this target until restart/HUP, with corresponding error message in 
log

> > 3rd category are errors related to FS activity during transfer
> ---> This is a good candidate for CONTINUE. Id just tested, if the source 
does not exist, it raises 23 level.

Yes, could be safely skipped.

> All others undocumented or future one --- DIE? Since something should be 
reconfigured.

Probably yes (or disable target).

Original comment by some.h...@gmail.com on 13 Oct 2010 at 4:35

GoogleCodeExporter commented 9 years ago

According to the discussion group, different admins have different 
understandings what to do so. So until a big recoding is made -- currently I 
plan to use LUA as new configuration tool -- how about simply using this bash 
rsync "driver"?. Simply replace the <binary> to point to this script.

----
#!/bin/bash
while [ 1 ]; do
    /usr/bin/rsync "$@"
    err=$?
    case $err in
    3|5|6|10|11|12|14|20|21|22|30|35)
        # network error, retry
        echo rdriver: retry on $err
        sleep 5
        ;;
    1|2|4|25)
        # kill parent (lysncd)
        echo rdriver: kill on $err
        kill $PPID
        ;;
    0|23|24|*)
        # done
        echo rdriver: done on $err
        break;
        ;;
    esac
done
----

Kind regards, Axel

Original comment by axk...@gmail.com on 14 Oct 2010 at 9:56

GoogleCodeExporter commented 9 years ago

Ok, thanks. I'll try to use it for a time being.

Original comment by some.h...@gmail.com on 14 Oct 2010 at 4:53

GoogleCodeExporter commented 9 years ago

The problem I see with the bash workaround, is that is can potentially spawn a 
huge number of processes while the mirror is down which could cause memory 
exhaustion on the source or hammering of many parallel rsync processes on the 
target, if it becomes available again.

Original comment by Felix.Bu...@gmail.com on 8 Nov 2010 at 2:56

GoogleCodeExporter commented 9 years ago

Btw. when using rsync-over-ssh a network error is usually indicated by exit 
code 255.

Original comment by Felix.Bu...@gmail.com on 8 Nov 2010 at 3:13

GoogleCodeExporter commented 9 years ago

lsyncd 1.x will halt and wait while a process is running. so there is never 
more processes than the number of targets. if one target is down and the bash 
script from above is blocking it will halt all targets. If it recovers from a 
longer period of time the inotify queue likely will have overflowed, in which 
lsyncd does a restart to recursively sync what has been missed. I'm addressing 
this issues in lsyncd 2.0 where multiple targets are treated seperately and 
will not block each other. For a single target system, the workaround should 
work just fine as one would expect.

Original comment by axk...@gmail.com on 8 Nov 2010 at 7:26

GoogleCodeExporter commented 9 years ago

Lsyncd 2.0 will repeat on network errors.

Original comment by axk...@gmail.com on 27 Nov 2010 at 1:20

Changed state: Fixed

solotimes / lsyncd

rsync IO error (due to unreachable peer) erroneously treated as success/done #30