segfault when getting ENOSPC

ghost commented 12 years ago

When writing a file too big to fit on the underlying filesystem, one of ganesha's threads segfaults:

[root@xxx] # df -h /mnt/ganesha
Filesystem            Size  Used Avail Use% Mounted on
xxx:lustre            4.0G  1.1G  2.7G  30% /mnt/ganesha
[root@xxx] # dd if=/dev/zero of=./foo bs=1k count=4096k
dd: writing `./foo': No space left on device
904646+0 records in
904645+0 records out
926356480 bytes (926 MB) copied, 19.605 s, 47.3 MB/s

One of nfs-ganesha's threads actually segfaulted:

lustre.ganesha.[26191]: segfault at 1620 ip 00000035d2c091e0 sp 00007f02bedeb7e8 error 4 in libpthread-2.12.so[35d2c00000+17000]

The main log file sometimes contains the following, but not each time:

02/03/2012 11:40:21 epoch=1330684821 : xxx : lustre.ganesha.nfsd-26055[Worker Thread #7] :nfs3_Errno :NFS PROTO: CRITICAL ERROR: Error CACHE_INODE_FSAL_ERROR converted to NFS3ERR_IO but was set non-retryable

Build info:

nfs-ganesha compiled on Mar  2 2012 at 10:47:13
Release = 1.4.0
Release comment = GANESHA 64 bits compliant. SNMP exported stats. FSAL_PROXY re-exports NFSv3. RPCSEC_GSS support (partial). FUSELIKE added
Git HEAD = 38257b0fb7d9ae399a5cb5ec4138989f6338e5fe
Git Describe = stable_09_feb_2012-76-g38257b0

phdeniel commented 12 years ago

The situation is really kind of messy here. I do not know if this is correlated or not, but as the server returns NFS3ERR_NOSPC, the client will not output immediately "no space left on device" on the end user's console. I made a basic (and temporary) change that makes every NFS3PROC_WRITE returning NFS3ERR_NOSPC whatever the call is. A simple "echo 1234 > file" ends with status 0 but makes no IO (files remains empty). Things that looks like a direct consequence of that : 'dd' would loop forever. I mount is made with option 'noac', I got IO error on the command line. I suspect something special to be returned by the server in the case ENOSPC is returned. I'll probably ask the linux-devel list about that. I will reverse-engineer the behavior of a linux knfsd server in this situation.

phdeniel commented 12 years ago

Behavior seems to be the same with knfsd. The fact that the client may use "write behind" operation is probably correlated in some ways with that. I cc the question to linux-nfs-devel. I do not think this is the same issue as the one that Kilian saw : he had a 'No space left on device' message and I could produce nothing but 'IO error' messages (on both ganesha and knfsd).

phdeniel commented 12 years ago

Here is the answer that I got from the linux-nfs mailing list.

Myklebust, Trond [Trond.Myklebust@netapp.com] wrote:

On Mon, 2012-03-05 at 18:16 +0100, DENIEL Philippe wrote:

Hi List,

I ran a stupid test (using the kernel's knfsd) : I filled up completely a filesystem with a few big file. When it was 100% full (no free block at all), I ran a dd on it. The dd said it could write 793 blocks of size 1mb and failed on IO error. At the end, I could see an empty file in the NFS exported tree. Question is :

why did I get EIO and not ENOSPC ?

Did the server actually return NFS3ERR_NOSPC, or did it return something else? If the server returns NFS3ERR_NOSPC, then I'd expect the client to translate that as ENOSPC.

AFAIR, the server returns ENOSPC but the Linux client has a bug under certain cases that returns it as EIO. Filed a bug and Neil sent a patch but I have not tested it yet.

I tried with various versions and they gave some differing results. I think in one case fsync() got ENOSPC but close() got EIO (or the other way around?)

Apparently, things are not that easy... But anyway, this is not the issue that Kilian saw.

phdeniel / nfs-ganesha

segfault when getting ENOSPC #64