mickours / lsyncd

Automatically exported from code.google.com/p/lsyncd
GNU General Public License v2.0
0 stars 0 forks source link

Data loss from failed rsync runs #57

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Run lsyncd and have it sync FSs /a/b to /x/y
2. Fill /x/y FS to brim, until ENOSPC is returned.
3. Copy new stuff in /a/b from somewhere

What is the expected output? What do you see instead?

lsyncd should detect that rsync has failed to sync /a/b to /x/y because of 
ENOSPC and it should retry the events. And of course, a syslog should be issued 
in this case.

What happens is that when the space does become available after user finds /x/y 
full, /a/b and /x/y are still out of sync.

What version of the product are you using? On what operating system?
2.0.4.

Please provide any additional information below.

The easy workaround is to restart lsyncd. But that should not be needed. Many 
times this failure will go undetected and user will happily think that lsyncd 
is working its magic, and his files are mirrored. By the time user may find out 
about the FS full situation, it may be too late (changes in /a/b may be dead) 
and a severe data loss may occur.

Original issue reported on code.google.com by dev...@gmail.com on 5 May 2011 at 11:18

GoogleCodeExporter commented 9 years ago
The rsync result is in that case error code 12. The same happens when the 
target directory doesnt even exist.

I'm not sure if it should keep retrying (forever), or die in a loudly way.

Original comment by axk...@gmail.com on 6 May 2011 at 5:23

GoogleCodeExporter commented 9 years ago
I think it should make a lot of noise. Syslogging to the console and 
/var/log/messages is a must.

Additionally, if we can raise an Xdialog if X is present or if we can send an 
event somehow to the desktop notification area if a desktop is running, that 
would be the best solution. Typically, in our setup we are using it to sync 
local folders to NFS mounts and this is a desktop environment. So, the X is 
always running and one of gnome or kde is always running.

And I think it should not quit. Because the loss may get much bigger when the 
space does become available and the users make more changes, which if the 
lsyncd is running would get synced otherwise (and may even sync some of the old 
lost changes to file). These are always-on desktop VMs. People don't keeping 
poking to see if lsyncd is running or not, and start troubleshooting if its not.

Original comment by dev...@gmail.com on 6 May 2011 at 5:43

GoogleCodeExporter commented 9 years ago
BTW, we will need to flood control the syslog messages if we decide to keep the 
lsyncd running.

Original comment by dev...@gmail.com on 6 May 2011 at 5:48

GoogleCodeExporter commented 9 years ago
We can do it that way: error code 12 back to "again". However, on startup, any 
"again" condition will result in a fail. (that may be due to full disc, target 
directory not there, or network connection not available). It think it should 
bail there on startup, since often this "temporal" failures, can easily be 
result of misconfiguration, e.g. a typo on the target directory, hostname, etc.

Regarding X window messages, I'll not add that into vanilla Lsyncd. But you can 
easily override the collect function of the default config, and import a 
Lua-GUI toolkit you like. Or one can write a seperate GUI Application that 
watches Lsyncd messages.

Original comment by axk...@gmail.com on 6 May 2011 at 6:41

GoogleCodeExporter commented 9 years ago
This problem should be imho solved in monitoring software. There are few 
softwares that will do monitor messages or other (lsyncd.log) given logs and 
parse for any nonstandard messages and send them (for example) once a day to 
mail. Imho this type of solutions keep things in "standard unix" spirit (many 
small specialized programs, each for one task).

Original comment by luva...@gmail.com on 15 Aug 2011 at 8:48

GoogleCodeExporter commented 9 years ago
Changes for upcoming 2.0.5. Any otherwise temporary - retryable error will make 
Lsyncd fail on startup unless settings.insist is set. That is because often 
enough this "temporal" errors are actually misconfiguration, like the target 
directory not be there or such.

Original comment by axk...@gmail.com on 17 Aug 2011 at 9:46

GoogleCodeExporter commented 9 years ago
Changes for upcoming 2.0.5. Any otherwise temporary - retryable error will make 
Lsyncd fail on startup unless settings.insist is set. That is because often 
enough this "temporal" errors are actually misconfiguration, like the target 
directory not be there or such.

Error code 12 is now again marked as "temporal", that is, in normal operation 
Lsyncd will keep trying (and have that 1 sync hang) until it eventually will 
make it through.

Otherwise I agree with luva... Any further notifications to the user should be 
left to other peoples monitoring software.

Original comment by axk...@gmail.com on 17 Aug 2011 at 9:50