larroy / clearskies_core

Open source, distributed, secure data synchronization using the clearskies protocol
GNU Lesser General Public License v3.0
136 stars 15 forks source link

Implement filesystem scanner #16

Closed larroy closed 10 years ago

larroy commented 10 years ago

Implement a syncrhonous filesystem scanner using boost filesystem for portability. It should traverse a share / directory and have access to a state holding information from previous scans.

As a first approach, implement it just as a dumb scanner without inotify or similar.

It should traverse the directory tree and compare file modification times against state, if file changed, recompute the hash and mark the file as "updated".

I think the state could well be stored in sqlite3 to avoid excessive memory consumption.

detunized commented 10 years ago

Check this out: https://github.com/facebook/watchman Presentation: http://www.youtube.com/watch?v=Dlguc63cRXg#t=190

jewel commented 10 years ago

One neat trick for the dumb scanner is to measure how long it takes to do the scan and then adjust the delay between scans accordingly. For example, they delay could be ten times the scan time, with a minimum of one minute.

Since inotify requires resources per directory monitored, very large shares might need to fall back to the dumb scanner. Also, it'd be wise to occasionally do a dumb scan to make sure that nothing was missed. You'd also always want to do a dumb scan on startup.

watchman seems to be missing support for windows. I know that the "listen" gem covers Windows as well as the other platforms, via a ruby gem for each platform. The relevant gem (wdm, in the case of windows) is written in C and might have code that we can use.

jewel commented 10 years ago

One other idea. If the directory traversal is a separate thread from the checksumming, it allows for easy "percent-complete" status information for the checksum process (which takes almost all of the time). To reduce unintended I/O load on spinning disks, the checksum thread can sleep while a full-system scan is taking place.

larroy commented 10 years ago

I seem to misspell synchronous all the time, sorry :-)

I like this idea, but why not then two passes? first pass is scan, then checksum. so then we have the sizes and have a progress indicator for checksum. I would like to keep the number of threads to a minimum.

jewel commented 10 years ago

Two pass is fine for the dumb scanner but when you add inotify it makes things more complicated, since file changes can come in at any time and have to be queued up for checksumming. (inotify will drop events if they aren't handled fast enough. I believe it's safe to stat() the file in the callback since the metadata will already be live, but other operations, such as writing to sqlite, should be deferred.

In any case, until we have file-change notification, better to keep it as simple as possible.