sahlberg / libnfs

NFS client library
Other
521 stars 201 forks source link

Possible NFS Race condition (XBMC on R-Pi with FreeNAS 8.x.x) #25

Closed MilhouseVH closed 10 years ago

MilhouseVH commented 12 years ago

Initial ticket raised on XBMC issue tracker (along with Wireshark packet captures): http://trac.xbmc.org/ticket/13324

XBMC on R-Pi (OpenELEC, Raspbmc) suffers NFS connectivity issues such as being unable to establish an NFS connection at all, and randomly unable to stat existing files/folders even though a connection is successfully established. This is when connecting with a FreeNAS 8.0.2 server (x64 build). These problems occur about 50% of the time after booting, the other 50% of the time the R-Pi is able to establish a connection and stat files/folders without any problem.

As far as I can recall there were no NFS problems with the R-Pi until around the time libNFS 1.3 was introduced (June/July-ish?)

Using OS-level NFS mounts (ie. mounting via /etc/fstab) in OpenELEC/Raspbmc always work reliably - it's only the NFS mounts established within XBMC that are problematic.

Let me know if I can provide any more detail on this frustrating issue.

MilhouseVH commented 12 years ago

Just to clarify, "R-Pi" stands for "Raspberry Pi"!

MilhouseVH commented 12 years ago

Just to add, XBMC (OpenELEC r12224) also has problems with the very latest FreeNAS 8.3.0-RC1.

Here's an extract of the XBMC log while communicating with FreeNAS 8.3.0-RC1 - sorry it has no debug level messages.

This communication occurred straight after booting into XBMC, and while browsing the main menu showing "Recently Added Movies":

19:10:04 T:1211384928   ERROR: NFS: Failed to stat(mnt/share/media/Video-Private/Movies/Screamers (1995)[DVDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOENT(-2)"
19:10:04 T:1219773536   ERROR: NFS: Failed to stat(mnt/share/media/Video/Movies HD/Three Kings (1999)[BDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOENT(-2)"
19:10:04 T:1176695904   ERROR: Texture manager unable to load file: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/Impostor (2001)[BDRip].tbn
19:10:04 T:1211384928   ERROR: Texture manager unable to load file: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies/Screamers (1995)[DVDRip].tbn
19:10:19 T:1176695904   ERROR: NFS: Failed to stat(mnt/share/media/Video-Private/Movies/Screamers (1995)[DVDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOENT(-2)"
19:10:20 T:1219773536   ERROR: Previous line repeats 1 times.
19:10:20 T:1219773536   ERROR: NFS: Failed to stat(mnt/share/media/Video-Private/Movies HD/Impostor (2001)[BDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOENT(-2)"
19:10:21 T:1176695904   ERROR: NFS: Failed to stat(mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOENT(-2)"
19:10:26 T:1074688528  NOTICE: COMXPlayer: Opening: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:26 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:26 T:1232749664  NOTICE: Thread COMXPlayer start, auto delete: false
19:10:26 T:1232749664  NOTICE: OMXClock using video as reference
19:10:26 T:1232749664  NOTICE: Creating InputStream
19:10:26 T:1232749664   ERROR: COMXPlayer::OpenInputStream - error opening [nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv]
19:10:26 T:1232749664  NOTICE: COMXPlayer::OnExit()
19:10:26 T:1232749664  NOTICE: COMXPlayer::OnExit() deleting input stream
19:10:26 T:1074688528   ERROR: XBMC.PlayMedia could not play media: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:26 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:29 T:1074688528  NOTICE: COMXPlayer: Opening: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:29 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:29 T:1232749664  NOTICE: Thread COMXPlayer start, auto delete: false
19:10:29 T:1232749664  NOTICE: OMXClock using video as reference
19:10:29 T:1232749664  NOTICE: Creating InputStream
19:10:29 T:1232749664   ERROR: COMXPlayer::OpenInputStream - error opening [nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv]
19:10:29 T:1232749664  NOTICE: COMXPlayer::OnExit()
19:10:29 T:1232749664  NOTICE: COMXPlayer::OnExit() deleting input stream
19:10:29 T:1074688528   ERROR: XBMC.PlayMedia could not play media: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:29 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:31 T:1074688528  NOTICE: COMXPlayer: Opening: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:31 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:31 T:1232749664  NOTICE: Thread COMXPlayer start, auto delete: false
19:10:31 T:1232749664  NOTICE: OMXClock using video as reference
19:10:31 T:1232749664  NOTICE: Creating InputStream
19:10:31 T:1232749664   ERROR: COMXPlayer::OpenInputStream - error opening [nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv]
19:10:31 T:1232749664  NOTICE: COMXPlayer::OnExit()
19:10:31 T:1232749664  NOTICE: COMXPlayer::OnExit() deleting input stream
19:10:31 T:1074688528   ERROR: XBMC.PlayMedia could not play media: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:31 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:42 T:1074688528  NOTICE: COMXPlayer: Opening: nfs://192.168.0.200/mnt/share/media/Video-Private/Movies HD/A Scanner Darkly (2006)[BDRip].mkv
19:10:42 T:1074688528 WARNING: CDVDMessageQueue(player)::Put MSGQ_NOT_INITIALIZED
19:10:42 T:1232749664  NOTICE: Thread COMXPlayer start, auto delete: false
19:10:42 T:1232749664  NOTICE: OMXClock using video as reference
19:10:42 T:1232749664  NOTICE: Creating InputStream
19:10:42 T:1232749664  NOTICE: OMXClock using video as reference
19:10:42 T:1232749664  NOTICE: Creating Demuxer
19:10:45 T:1232749664  NOTICE: Opening video stream: 0 source: 256
19:10:45 T:1232749664  NOTICE: OMXClock using video as reference
19:10:45 T:1232749664  NOTICE: Creating video thread
19:10:45 T:1268773984  NOTICE: Thread COMXPlayerVideo start, auto delete: false
19:10:45 T:1232749664  NOTICE: Opening audio stream: 1 source: 256
19:10:45 T:1232749664  NOTICE: Creating audio thread
19:10:45 T:1277162592  NOTICE: Thread COMXPlayerAudio start, auto delete: false
19:10:45 T:1232749664  NOTICE: Opening Subtitle stream: 2 source: 256
19:10:45 T:1232749664  NOTICE: OMXClock using video as reference
19:10:45 T:1268773984  NOTICE: Display resolution DESKTOP : 1920x1080 @ 60.00 - Full Screen (16)

Note how XBMC is unable to load a bunch of thumbnails to be shown in "Recently Added Movies", and then won't play "A Scanner Darkly" despite several attempts, however after a short delay (11 seconds) it suddenly decides that it can find and will now play this movie (and it also subsequently managed to find and display the thumbnails it had been unable to locate earlier).

This is often how it is with XBMC and NFS connections to FreeNAS - boot XBMC, and for the first few minutes the NFS connections will fail with bogus errors.

sahlberg commented 11 years ago

I would need a network trace for this.

MilhouseVH commented 11 years ago

I attached some traces in the linked xbmc bug report (see first comment) - one or two were too large so are stored on my personal webspace. Let me know if any of those captures are of any value to you, if not I'll endeavour to capture more.

sahlberg commented 11 years ago

Try compiling the latest version of xbmc a couple of issue shave been fixed recently that could be related. At the same time, try compiling and using the ZDR branch of libnfs. This branch is intended to replace master and become the next release of libnfs.

On Tue, Nov 20, 2012 at 7:50 PM, MilhouseVH notifications@github.comwrote:

I attached some traces in the linked xbmc bug report (see first comment) - one or two were too large so are stored on my persona

MilhouseVH commented 11 years ago

Thanks. I'm currently testing OpenELEC 3.0 Beta 1, not sure which build of libnfs this has, will see how it goes. I'm not really setup to compile for ARM, hopefully it won't come to that!

sahlberg commented 11 years ago

Another thing you could try.

XBMC uses a completely new context for each file it accesses, so there are a few things that could cause sporadic unconnectivities.

1, can you try making sure you access the nfs server by ip address and not by dns name? Since each file open, each directory listing etc, each will initialize a new connection and need to use resolve the host into an ip, sporadict dns outages would cause things that wlould look like this.

2, Secondly, since here we would end up with a lot of TCP connections being opened and then closed all going to the same destination, you might end up with a huge amount of ephemeral ports in TIME_WAIT. I.e. since each file open/list dir etc represents a new tcp connection to the target (or several) I think you might chew through the ephemeral range of ports quite quickly. At which stage you might not be able to originate any more outgoing tcp connections until some old conenctions have expired from the TIME-WAIT state. (unless you run as root)

Check what the range used for ephemeral ports are and see what netstat -tapn looks like whn this happens. Check what the ephemeral port range is and increase it if possible and see if it makes a difference.

If when this happens there is nothing interesting on the wire you might need to start adding printfs and logging to see what/where it fails.

On Tue, Nov 27, 2012 at 12:06 PM, MilhouseVH notifications@github.comwrote:

Thanks. I'm currently testing OpenELEC 3.0 Beta 1, not sure which build of libnfs this has, will see how it goes. I'm not really setup to compile for ARM, hopefully it won't come to that!

— Reply to this email directly or view it on GitHubhttps://github.com/sahlberg/libnfs/issues/25#issuecomment-10774481.

MilhouseVH commented 11 years ago

I'm still getting NFS issues with FreeNAS 8.3.0-RELEASE and Raspberry Pi/OpenELEC/XBMC 12.

This is despite booting the Raspberry Pi over NFS, but within XBMC there will be errors while accessing NFS files.

When booting the latest OpenELEC build (as of now, built from git) and scrolling through the main menu (Movies -> TV Shows -> Music, each of which has "Recently added" active) there will often (but not always) be NFS errors appear in the log, and some media (but not all) will be unplayable at the first and subsequent attempts, but will play later on.

The example that follows are the NFS entries from the debug log when scrolling from Movies to TV Shows to Music, with only errors appearing for TV Show content on this occasion:

01:33:09 T:2820666464   DEBUG: SECTION:LoadDLL(libnfs.so.1)
01:33:09 T:2820666464   DEBUG: Loading: libnfs.so.1
01:33:11 T:2820666464   DEBUG: NFS: Context for 192.168.0.3/mnt/share not open - get a new context.
01:33:13 T:2820666464   DEBUG: NFS: Connected to server 192.168.0.3 and export /mnt/share
01:33:13 T:2820666464   DEBUG: NFS: chunks: r/w 32768/32768
01:33:13 T:2820666464   ERROR: NFS: Failed to stat(mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOTDIR(-20)"
01:33:13 T:2820666464   DEBUG: GetImageHash - unable to stat url nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.tbn
01:33:13 T:2944951392   ERROR: NFS: Failed to stat(mnt/share/media/Video/TV Serials/The Mentalist/Season 5/The Mentalist S05E15.tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOTDIR(-20)"
01:33:13 T:2944951392   DEBUG: GetImageHash - unable to stat url nfs://192.168.0.3/mnt/share/media/Video/TV Serials/The Mentalist/Season 5/The Mentalist S05E15.tbn
01:33:13 T:2944951392   ERROR: NFS: Failed to stat(mnt/share/media/Video/TV Serials/CSI Crime Scene Investigation/Season 13/CSI Crime Scene Investigation S13E16.tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOTDIR(-20)"
01:33:13 T:2944951392   DEBUG: GetImageHash - unable to stat url nfs://192.168.0.3/mnt/share/media/Video/TV Serials/CSI Crime Scene Investigation/Season 13/CSI Crime Scene Investigation S13E16.tbn
...
01:34:46 T:3043766272  NOTICE: NFS is idle. Closing the remaining connections.
01:35:17 T:3043766272   DEBUG: SECTION:UnloadDelayed(DLL: libnfs.so.1)
01:35:17 T:3043766272   DEBUG: Unloading: libnfs.so.1

First attempt to play NCIS S10E17 from the Recently Added list:

01:39:35 T:3043766272   DEBUG: ExecuteXBMCAction : To PlayMedia("nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4")
01:39:35 T:3043766272   DEBUG: CPlayerCoreFactory::GetPlayers(nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4)
01:39:35 T:3043766272  NOTICE: COMXPlayer: Opening: nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4
01:39:35 T:2840589408   DEBUG: SECTION:LoadDLL(libnfs.so.1)
01:39:35 T:2840589408   DEBUG: Loading: libnfs.so.1
01:39:36 T:2840589408   DEBUG: NFS: Context for 192.168.0.3/mnt/share not open - get a new context.
01:39:37 T:2840589408   DEBUG: NFS: Connected to server 192.168.0.3 and export /mnt/share
01:39:37 T:2840589408   DEBUG: NFS: chunks: r/w 32768/32768
01:39:37 T:2840589408    INFO: CNFSFile::Open: Unable to open file : 'mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4'  error : 'open call failed with "NFS: Lookup of //media failed with NFS3ERR_NOTDIR(-20)"'
01:39:37 T:2840589408   ERROR: COMXPlayer::OpenInputStream - error opening [nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4]
01:39:37 T:3043766272   ERROR: XBMC.PlayMedia could not play media: nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4

Play something else, another TV show which does work successfully, then try playing NCIS S10E17 again (which also plays fine, despite not doing so previously):

01:41:55 T:3043766272   DEBUG: ExecuteXBMCAction : To PlayMedia("nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4")
01:41:55 T:3043766272   DEBUG: CPlayerCoreFactory::GetPlayers(nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4)
01:41:55 T:3043766272  NOTICE: COMXPlayer: Opening: nfs://192.168.0.3/mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4
01:41:55 T:2840589408   DEBUG: CNFSFile::Open - opened mnt/share/media/Video/TV Serials/NCIS/Season 10/NCIS S10E17.mp4

Please advise what information you need to resolve this ongoing issue - I can try and obtain a network trace if it will be helpful.

MilhouseVH commented 11 years ago

This is the result of running netstat -tapn on the Raspberry Pi at the time when it couldn't access NFS files (in this case, it was a movie):

root@rpi512 ~ # netstat -tapn && sync
netstat: showing only processes with your user ID
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:9090            0.0.0.0:*               LISTEN      1131/xbmc.bin
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1131/xbmc.bin
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1094/sshd
tcp        0      0 0.0.0.0:36667           0.0.0.0:*               LISTEN      1131/xbmc.bin
tcp        0      0 0.0.0.0:445             0.0.0.0:*               LISTEN      1321/smbd
tcp        0      0 192.168.0.8:708         192.168.0.3:881         TIME_WAIT   -
tcp        0      0 192.168.0.8:585         192.168.0.3:111         TIME_WAIT   -
tcp        0      0 192.168.0.8:56051       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56046       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56049       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56057       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:59054       192.168.0.200:111       TIME_WAIT   -
tcp        0      0 192.168.0.8:56054       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56050       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56053       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56045       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:56052       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:51010       192.168.0.200:111       TIME_WAIT   -
tcp        0      0 192.168.0.8:56048       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:659         192.168.0.3:881         TIME_WAIT   -
tcp        0      0 192.168.0.8:56055       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:703         192.168.0.3:111         TIME_WAIT   -
tcp        0      0 192.168.0.8:717         192.168.0.3:111         TIME_WAIT   -
tcp        0      0 192.168.0.8:718         192.168.0.3:881         TIME_WAIT   -
tcp        0      0 192.168.0.8:56047       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 192.168.0.8:716         192.168.0.3:2049        ESTABLISHED 1131/xbmc.bin
tcp        0      0 192.168.0.8:56056       192.168.0.201:3306      TIME_WAIT   -
tcp        0      0 :::22                   :::*                    LISTEN      1094/sshd
tcp        0      0 :::36666                :::*                    LISTEN      1131/xbmc.bin
tcp        0      0 :::445                  :::*                    LISTEN      1321/smbd

192.168.0.3 is the NFS server hosting Movies/TV Shows/Music content 192.168.0.8 is the Raspberry Pi 192.168.0.201:3306 is a MySQL server 192.168.0.200 is the NFS server being used to boot the Pi (mounted as /System and /Storage on the Pi)

And the NFS errors in the debug log at the time the above information was captured:

01:58:47 T:2865755232   DEBUG: SECTION:LoadDLL(libnfs.so.1)
01:58:47 T:2865755232   DEBUG: Loading: libnfs.so.1
01:58:49 T:2865755232   DEBUG: NFS: Context for 192.168.0.3/mnt/share not open - get a new context.
01:58:50 T:2865755232   DEBUG: NFS: Connected to server 192.168.0.3 and export /mnt/share
01:58:50 T:2865755232   DEBUG: NFS: chunks: r/w 32768/32768
01:58:51 T:2865755232   DEBUG: CNFSFile::Open - opened mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn
01:58:51 T:2865755232   ERROR: NFS: Failed to stat(mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn) stat call failed with "NFS: Lookup of //media failed with NFS3ERR_NOTDIR(-20)"
01:58:51 T:2865755232   DEBUG: CNFSFile::Close closing file
01:58:51 T:2865755232   ERROR: COMXImage::ReadFile nfs://192.168.0.3/mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn not found
01:58:51 T:2865755232   DEBUG: CNFSFile::Open - opened mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn
01:58:51 T:2865755232   DEBUG: CNFSFile::Close closing file mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn
01:58:51 T:2865755232   DEBUG: LoadFromFileInternal - Load of nfs://192.168.0.3/mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn failed. Falling back to ImageLib
01:58:51 T:2865755232   ERROR: Texture manager unable to load file: nfs://192.168.0.3/mnt/share/media/Video/Movies/Classics/It's a Wonderful Life - Colorized (1946)[DVDRip].tbn
MilhouseVH commented 11 years ago

One thing I notice repeatedly, and this is with the very latest OpenELEC on RPi and FreeNAS 8.3.1-RELEASE, is the if a video library scan is initiated after XBMC outputs the "NFS is idle. Closing the remaining connections." message, then the library will be populated with junk items. It's as if NFS is returning bogus results after allowed to idle.

sahlberg commented 10 years ago

This sounds like weak/broken xid replay cache handling in the server. I have recently improved the xid generation using in libnfs to try to reduce the amount of false collissions in the xid replay cache in the server. (dbench suffers this from certain servers, mainly linux based, that have very weak xid replay cache collission detection.)

Please try with libnfs 1.8

MilhouseVH commented 10 years ago

Have built OpenELEC with libnfs 1.8, will report back if there are any further problems. Even with libnfs 1.6 in OpenELEC I would see problems with NFS failing to stat files and directories, most often after the connection had been closed as idle (it's as if the NFS connection is not being re-established before NFS is queried), so hopefully 1.8 will be a more reliable solution.

sahlberg commented 10 years ago

No further issues reported.