Handle temporary readdir errors

amir73il commented 2 years ago

This work was done to tackle real world issues encountered with fpsync of large data sets from commercial SMB file servers.

readdir() errors used to be rare in filesystems, which may explain how the FTS bug still exists in distros today. One of the reasons to start seeing readdir() errors recently on Linux SMB client is the addition of support for compound SMB requests added in recent Linux kernels.

That support defers the opendir() request to the server until the first readdir() call and opendir() on the server is prone to ENFILE/EMFILE temporary errors, which get reflected as EBADF to the readdir() caller.

martymac commented 2 years ago

Hello Amir,

Thanks a lot for your patch!

I am quite busy at the moment, I'll have a look at it ASAP :)

amir73il commented 2 years ago

I am quite busy at the moment, I'll have a look at it ASAP :)

No problem, just please note the issue with TOOL_MODEOPTS_R. The job with the single erroneous directory should use recursive args. I've only tested with rsync BTW.

martymac commented 2 years ago

Hello Amir,

As you have seen, I've started adding a few comments. I still have to look at the patch in deeper details, mostly regarding the fact that we would use a negative size for error reporting.

I have also seen cosmetic things that could be changed. Maybe you could give me write access to your forked branch so we could update it together ?

amir73il commented 2 years ago

Hello Amir,

As you have seen, I've started adding a few comments. I still have to look at the patch in deeper details, mostly regarding the fact that we would use a negative size for error reporting.

I have also seen cosmetic things that could be changed. Maybe you could give me write access to your forked branch so we could update it together ?

@martymac I don't know how to do that. But anyway, please feel free to close the PR, checkout my readdir_error branch, push a branch the same name to your Github and modify it as you wish or make changes directly to your master branch if you prefer.

If you want me to make more changes I can sync back your branch to mine. Let me know how you want to proceed.

Thanks, Amir.

martymac commented 2 years ago

Hi again Amir,

As you suggested, I have duplicated your branch to a new one in my repo, so you will be able to re-sync yours from it.

As you can see, I have already changed a few things. The fts/readdir() patch seems ok now, I'll probably submit it upstream (FreeBSD) too.

I still have to study other parts of the patch and will probably come back with several questions soon :)

Best regards,

Ganael.

martymac commented 2 years ago

Hi again Amir,

I think patches to fts + verbosity increasing are good, so no pb about them. I'll only discuss here changes made to fpart (mostly src/fpart.c and src/file_entry.c).

I understand that the overall goal of the patch is to isolate a failing directory alone in a "dedicated" partition and be able to identify it from fpsync to pass the recursive flag to rsync (right ?). But how will rsync handle the problem ? Doesn't that just postpone the troubles ? (and, on the opposite, fpart could be successfully traversing a directory but rsync could be failing on it afterwards if the underlying FS is not reliable - rsync crawls FS very fast too, it could trigger a temporary error the same way, couldn't it ?). If you can give me more details, I'll appreciate that.

Anyway, the idea is interesting as it will help identify/isolate broken parts of a filesystem. So we need a way to provide error status to fpsync to enable recursive flag or not. I can see two cases, as -Z is restricted to live mode :

1 - partitions printed to a file (option -o) 2 - partitions printed to stdout

The first is the one used by fpsync, so why not just add an additional env variable reflecting a global status regarding the partition ? FPART_PARTSTATUS for example, that would be 0 if success and 1 (or fts_errno ?) in case of error.

For the latter case (stdout), instead of using max partition size as signed and divide its max possible value by 2 (which I would like to avoid), why not just display (we are writing to stdout) a letter instead of the size, such as 'E' for error (or hardprint '-1' if you prefer) and let the consumer program handle the case. We would avoid to touch the current partition size, and hack on it.

Your option -Z would just become : pack un-readable directories in separate partitions.

What do you think ? Would that filt your needs ?

Regarding the "consuming" part (fpsync), I'll review your changes later when that underlying part is handled.

amir73il commented 2 years ago

@martymac your suggestion in fine by me. I do not mind how fpart exports the error as long as fpsync will use recursive rsync options on the single directory. For that matter FPART_PARTSTATUS is perfectly fine and makes sense for fpsync.

Regarding your question about the use case, there are two answers:

Empiric reason: We have tested this solution in production many times on a large data set on an Enterprise File Server. Before the fix, subtrees would often silently not get synced by fpsync. After the fix, most of the time, everything is synced properly and when it doesn't rsync reports the error.
Theoretical reason: Typically, when syncing a very large data set from a file server, fpart can take a few minutes and traverses many dirs very fast so it is not unlikely for an SMB server to fail readdir due to temporary overuse of resources. rsyncing the parts OTOH then takes a lot longer, hours, and the rate of dir traversal is throttled by the rate of rsync of parts and is not anywhere near the rate of dir traversal by fpart. In practice, readdir errors observed during rsync of parts are rare.

martymac commented 2 years ago

Hi Amir,

Thanks for the clarification. I'll work on my branch and update the code regarding that solution ASAP. Stay tuned :)

Cheers,

Ganael.

martymac commented 2 years ago

Hello Amir,

I have pushed a first update ; as already discussed, I've tried to generalize your idea.

Now, in live mode, I always propagate errno to a new hook variable called FPART_PARTERRNO (-Z not necessary) which will contain 0 if every single partition's entry has been fts_read() without error, else last entry's errno.

A negative size is not used anymore to indicate the error (you should use the dedicated hook variable instead to detect errors) : an erroneous partition is marked as zero-sized. As a consequence, -Z is not incompatible anymore with -zz. In fact, it requires it and can be seen as a complement (as just telling how to pack those erroneous directories marked as empty).

Also, I removed the requirements for OPT_LEAFDIRS (option -D, marked as -E in your original patch), as I don't know why it would be required (?).

This is still not well tested, so please test !

Two things remain :

1) See if FTS_ERR needs to add the enrtry like FTS_DNR does (your original patch did, but I am not sure we should). The problem is : fts can (now, after being fixed) return FTS_ERR for a readdir() error, but also for a file. I don't know if there is a safe way to handle that here (I don't want to add an erroneous file entry when dealing with directories).

2) patch fpsync to consume FPART_PARTERRNO

Any feedback welcome ! And many thanks again for your contribution :)

amir73il commented 2 years ago

@martymac your design changes look good. Removing dependency of OPT_LEAFDIRS looks correct. Regarding the questions about FTS_ERR, I don't know the answer. I got to this code with a bit of trial and error. I will try to prepare a reproducer for you so you can experiment with readdir errors yourself.

martymac commented 2 years ago

Thanks for your feedback.

Yes, if you have something to reproduce the problem, that can be useful, thanks!

amir73il commented 2 years ago

You can use https://github.com/amir73il/fsnotify-utils/blob/master/src/test/fanotify_example.c to monitor a mount with permission events and fail every Nth readdir requests, for example: sudo ./fanotify_example /mnt/nfs/ 3 or modify the example as you see fit

amir73il commented 2 years ago

@martymac, I suspect something is not right with your change to handle FTS_ERR. When traversing small dirs, readdir is called exactly twice, once to get the content and another time to get a "no more files" result.

If you run the fanotify_example with error injection every even number of readdir, errors will happen after dir entries have already been read and the outcome is incorrect. Please see file readdir_error_tests.txt on my readdir_error branch.

Also please note that at least on Ubuntu 20.04.3, fpart must be configured with --enable-embfts to even observe the readdir errors.

martymac commented 2 years ago

Hello Amir,

Thanks for your feedback. Unfortunately, as fas as I know, fanotify is not available on FreeBSD (fpart is developped on FreeBSD) so I won't be able to use your tool easily. I was thinking to using LD_PRELOAD and simulate readdir() that way, but it's not ready yet.

Anyway, I see two main differences with your original patch :

as explained earlier, FTS_ERR is not (yet) handled in my patch, only FTS_DNR is (which is probably wrong, but I would like to avoid adding file entries when returning from a file-related FTS_ERR when dealing with directories)
I added a missing part of the original patch to embedded fts version (anyway, it should be OK)

When you write "please note that at least on Ubuntu 20.04.3, fpart must be configured with --enable-embfts to even observe the readdir errors" do you mean that with Ubuntu's system fts, everything goes right ? Does it include fts' readdir() patch or not at all ?

Could you try to revet db26cde to see if the difference comes from fts() or the fact that FTS_ERR is not handled yet ?

amir73il commented 2 years ago

If you have a Linux machine handy, you can test fpart on an nfs mount to nfs export on the Linux machine. Then you could use fanotify on the server side, but the results could be different then what I got. Using LD_PRELOAD sound like a good approach for unit tests.
I did already check with and without the FTS patch, both on mine and your branch - no change
I mean Ubuntu's system fts doesn't have the libc patch at all, so it does not return FTS_DNR I observer the same with CentOS7, so you may want to consider building the distro packages with --enable-embfts, or notify downstream maintainers of fpart packages for those distros

martymac commented 2 years ago

Hello Amir,

I have been looking at the results, the difference probably comes from FTS_ERR not adding the entry, but I am struggling to reproduce the problem (my current LD_PRELOAD lib, faking readdir() breaks crawling at first level).

Could you rerun the exact same test "Errors injection at even #readdir" (the last one) with -vvv (instead of -vv) on both branches please ? That way, I'll get fts_info.

(meanwhile, if you want, you can also try add the directory entry in FTS_ERR the same way as it is done in FTS_DNR to see if it fixes the problem)

Thanks for your help !

Ganael.

martymac commented 2 years ago

Hi again Amir,

I've pushed a fix that makes FTS_ERR handled as FTS_DNR for errors regarding directories (partial readdir()s). That should fix your problem.

Could you try it and tell me how it goes please ?

amir73il commented 2 years ago

@martymac I would test your change only I realized that the outcome of my branch for FTP_ERR is wrong. First of all, as far as I recall, the only readdir errors I have seen in real life are on first readdir due to deferred opendir on server. So for that case, extending the meaning of FTS_DNR to error in first readdir as you already did suffice and the "odd #readdir" test case works as expected with your branch.

So how should we handle FTS_ERR? The outcome of my branch is wrong because it would end up dispatching parallel recursive rsync jobs to dirs a/b/c a/b/c/d/e/f and a/b/c/d/e/f/g/h/i Nothing good can come out of this. So in fact, the only two sane reactions to FTS_ERR are to emit an error and continue (as your branch does) or abort the program with an error. For production, I would prefer to have the second option as opt-in. Do you think you could add that option?

martymac commented 2 years ago

Hi Amir,

Thanks for your feedback.

I really think we should handle FTS_ERR.

My tests show that a partially-read directory returns through FTS_ERR (that was the point of your fts patch) with p pointing to the directory entry, that's what I was wondering and that's good news because we can safely make the distinction between misc errors and partial directory reads from here.

Also, a partially-read directory never returns through FTS_DNR, so we may skip directory entries if we do not handle that case in FTS_ERR (that is what was happening in your tests). I agree that it may not be important for Linux SMB client but that may happen with other implementations (?).

The only thing to note is that, when handling FTS_ERR that way and using option -zz without -E, we can end up adding erroneous directories as zero-sized while files within those directories have already been packed. In fact, option -zz can be rephrased as "treat un-readable or erroneous directories as empty, causing them to be packed anyway".

As an example, we may end up with :

Partition 1 containing /tmp/foo/file1 [readdir() error while reading /tmp/foo/] Partition 2 containing /tmp/foo/ + errno set

But I think it is what we want as the aim of fpart is to partition a FS tree (i.e. not loose/exclude files). In my example, as foo/ is erroneous, it is better to have it packed than not, to avoid missing remaining files within. That way, the consumer program can decide to recurse on it given the value of FPART_PARTERRNO and have another chance to copy it. In that case, the worst scenario is trying to recurse on a failing directory, while if we do not pack it, it would be loosing files.

As a conclusion, I think we can leave the patch as is. What do you think ? Does the current implementation produce better (acceptable) results for your tests ?

That will leave fpsync to adapt. I still don't know if it would be useful/needed for fpsync to be able to differentiate FTS_ERR dirs from FTS_DNR ones (both zero-sized with FPART_PARTERRNO set). I'll have a look at that...

amir73il commented 2 years ago

This plan sounds good. I'm away from my workstation so will test whatever you have next week. I just wonder about packing the overlapping erroneous dirs. It's better to pack them then loosing them but perhaps fpsync should run those jobs not in parallel to other jobs to avoid stepping on each other toes?? For this matter if readdir error was caught before reading any files there is no such concern...

martymac commented 2 years ago

Hi Amir,

Not sure it would be a problem as re-syncing an erroneous dir would -at worst- re-sync already-synced (or currently being-synced, in case of parallel jobs) files. The last sync job wins. We are already in the same situation if the files are being edited underneath while a sync job is running. As we are working on improving option -E (only), which is supposed to be used for final passes (to enable --delete), that should not be a problem.

I'll work on the fpsync part and enable that as "aggressive mode" (i.e. enabling rsync's recursive flag for erroneous dirs). I propose to enable it by passing option -E twice, which sound logical to me.

I'll come back with a patch soon.

Meanwhile, it would be nice if you can confirm the fpart part is OK. No hurry anyway... :)

Cheers,

Ganael.

amir73il commented 2 years ago

Hi Ganael,

Results of your branch are now identical to results of my branch on my data set with odd and even error injection patterns.

I pushed the file with results including -vvv to my branch. -EE flag for fpsync sounds good to me.

Thanks, Amir.

martymac commented 2 years ago

Hi Amir,

Thanks for the new results. It looks all good!

I've just push the fpsync patch. Could you try it and tell me how it goes ?

Thanks!

amir73il commented 2 years ago

Works for me :) I used prepare mode to prepare the parts and queue and then stopped fanotify and ran fpsync in resume mode then recursive rsync found the skipped dirs. Thanks!

martymac commented 2 years ago

Good news, thanks!

I'll rebase / merge everything ASAP.

Thanks again for your contribution!

martymac commented 2 years ago

Hi Amir,

I've manually merged + pushed all changes from that branch to the master branch.

Thanks again for your contribution!

I'll now close that merge request.

Best regards,

Ganael.

martymac / fpart

Handle temporary readdir errors #37