Open MarSoft opened 2 years ago
Original but defunct Lsyncd author here, I skimmed over the code and it seemed to me as well there is a race condition in here. You need to set last_rsynced time before you run rsync, because for changes during the rsync command you just don't know if rsync got them over the line or not. You need to call it another time for everything that happened in the meantime. Unless I'm mistaken this should be easily fixable by simply exchanging line 45 and 46 main.py -- but not tested.
I still suggest making a heavy duty unit test that randomly creates/changes/deletes and moves a ton of files and directory in rapid succession and then runs a diff to check if source and target are the same. Like churn-rsync.lua but it's Lua and I guess you want to redo it Python. Took me a while to remove all race conditions (AFAIK). Funny enough when I first published it, multiple people told me that you just can't reliable create an asynchronous live program, but it is possible, but there are pitfalls.
Further short notes, IMO you should consider network error codes of rsync to retry, otherwise you run out of sync on a network hick up. And what I learned from a lot of feedback, don't do --delete by default config, I know Lsyncd has that, but that was less ideal in hindsight and would do different.
Two minor difference things, moves via ssh feature isn't there, but that was not so important optimization requested and Lsyncd always calls rsync with a custom filter setup so rsync only has to inspect those files that actually changes, that can be a big difference on large file trees. On the flipside using fswatch as established peer program I guess you got fanotify functionality out of the box, something Lsyncd never got. Back the day the kernel using fanotify didn't report moves thus was not usable and had to rely on inotify which uses kernel memory for every directory watched. It has been fixed since then.
Honestly if I would redo everything from scratch, I'd use a database like leveldb to store the timestamps of the source tree, thus the need to always run a full initial rsync on startup should fall flat.
Wish you much success on your project! Back the day when I needed a live asynchronization functionality I was surprised there was no opensource solution available, we certainly can use more diversity here. A few projects flickered up but unfortunally most seem to died of again. There was also one in Python -- unfortunally I forget the name -- it used multithreading with Pythons built in semaphore mechanics the code even seemed not very complicated.
Ah no, one thing I forget which indeed makes the loop more complicated (as MarSoft suggested) you should first fully exhaust the message queue from your notification mechanism, then set the timestamp (for everything covered) and then call rsync. Dunno how your mechanism works here, put you'd need some peek functionality to see if another event is incoming before you try to receive it (and thus block if it isn't)
Thank @MarSoft @axkibe for your great suggestions. This package was built within a night since I needed a quick solution to synchronize the source code in my local machine to a remote server where I could compile and run my code. Many things have not been considered yet.
Unfortunately, I am not using the output of fswatch for rsync. fswatch is just a signal to activate rsync (rsync still has to run on the entire source tree). Besides, it seems that setting a time interval for two consecutive rsync
is not the ideal, efficient solution also.
My actual experience when using this package is that it's too slow. After changing the source code, I often go directly to the server's console and run the code, however, sometimes, the code wasn't there, making the debugging experience awful. Probably, this package or even lsyncd
would be more suitable for backing up purposes.
I think this package needs a deeper thought about the use cases and especially, its efficiency. Probably, some things like mirror
would be more suitable here.
Consider the following: 2 files were changed almost simultaneously - A, then B. File A is large enough.
fswatch
detects that file A is changed and tells us. We find out that file A was changed and startrsync
. Now,rsync
runs for long enough, and during that filesB
andC
are changed too. Afterrsync
has finished, we setlast_synced
time to the current time (i.e. to the time whenrsync
has finished) and find out the next line fromfswatch
telling us that fileB
is changed, too. We check current time and find that just a little moment has passed since we did the last sync (as we check synchronization end time, not synchronization start time). So we don't runrsync
.Now, nothing is changed for a long time. And directories are out of sync: file
A
is synchronized but fileB
isn't. This will only be fixed after some other file is changed, which may happen not very soon.What we should do IMO: if no new changes happen for
rsync_interval
seconds but we know about some non-synchronized data then still runrsync
. This will complicate the code logic, and simplefor line in ...
loop won't work in this scenario. But it will avoid having directories unsynchronized for too long.