UXP: glusterfs vague volume mount error and true cause not being exposed to user via describe event

screeley44 commented 8 years ago

Problem: As part of an on-going effort to improve UX, we are addressing issues in small units with the goal to eventually expose all correct errors to users as a first step and then follow on work to then make those errors more consumable by users. This is also related to #22992 and #23048 currently being worked in a PR.

Example: User creates bad endpoints or incorrect server IP for glusterfs PV. The error exposed to the user in the 'describe pod' is very vague while underlying error is being eaten and only exposed in the logs making it hard for users to parse and find:

Current exposed error in describe event:

Output: Mount failed. Please check the log file for more details.

Real error being eaten in volumes.go

transport endpoint is not connected

Solution: Find where the error is being eaten (volumes.go getPodVolumes) and try to expose the error in the describe events.

Notes: As stated, after we have properly exposed real errors so they are less vague and confusing, we should then help to make the errors more consumeable. Related to openshift origin issue: https://github.com/openshift/origin/issues/7905

vishh commented 8 years ago

cc @kubernetes/sig-storage

ghost commented 8 years ago

@screeley44 makes sense... I understand you are planning to submit PRs to fix these issue right ? If so just go ahead an open the PR no need to open an issue before every PR especially for straight forward stuff.

rootfs commented 8 years ago

the exposed error Output: Mount failed. Please check the log file for more details. that comes from mount.glusterfs

In this case, when kubelet gets a bad endpoints, a sample kubectl describe will print out the failed mount command like the following.

  1m    6s  5   {kubelet 127.0.0.1}     Warning FailedSync  Error syncing pod, skipping: glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 10.1.4.100:test_vol /var/lib/kubelet/pods/a3572759-fce8-11e5-aebe-b8ca3a62792c/volumes/kubernetes.io~glusterfs/glusterfsvol glusterfs [ro log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/glusterfsvol/glusterfs.log]
Output: Mount failed. Please check the log file for more details.

The above will tell what log file to look at and that log will give a hint like

[2016-04-07 17:46:23.904297] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 10.1.4.100:24007 failed (No route to host)

Now the question is should kubelet parse glusterfs log file and expose the content of it?

screeley44 commented 8 years ago

@swagiaal - correct I plan on submitting PR's - ok, good to know, I wasn't sure correct process flow between issues/PRs and future plans and work - will move this to a PR

screeley44 commented 8 years ago

@rootfs - interesting idea, I was thinking a bit simpler as I've been going through this stuff, but similar concept that you are bringing up. So wrt how best to make errors friendly and consumable, even helpful to users, I was first thinking that we need to def capture the best error that is produced today, and then in the kubelet before the error is turned into an event for describe to show, massage or analyze that error to determine best type of hint/advice we can give the user to fix it or find more info about it. So for example, if I know I have a glusterfs vol plugin type and I can expose the truer error that linux_mount.go produces which is "transport endpoint is not connected" - I then can infer that the gluster cluster endpoints are not reacheable for whatever reason (invalid IPs, no route to host, etc...) and then I would expose friendly tip to the user on how to resolve. As I dig into this I will investigate and definitely think about this some more though as your idea is interesting

rootfs commented 8 years ago

The risk with parsing glusterfs log is that we are tying kubelet to certain glusterfs release and maintain consistent reporting. The glusterfs log messages could change and kubelet may not always be able to parse correctly.

screeley44 commented 8 years ago

@rootfs not sure this would fly or not, but i was thinking creating an updateable resource (similar to scc yamls) if you had proper permission (like admin) that would hold a mapping and as an error is received into kubelet, match on that error and volume type plugin that could then produce a meaningful, even customized error/hint that would be useful for the user and could be very specific for the environment/enterprise where it is being implemented. Or another idea is to just simply massaging the error that comes into the kubelet with another function that does some best guessing based on the error msg and volume type before it spits it out to the event infrastructure.

i.e.

from describe event

  57s   57s 1   {kubelet k8dev.rhs}     Warning FailedSync  Error syncing pod, skipping: Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /home/screeley/data/pods/88e757f6-dfd2-11e5-a7b5-52540092b5fb/volumes/kubernetes.io~nfs/nfsvol nfs []
Output: mount.nfs: Connection timed out

resolution hint: Check and make sure the NFS Server is reachable and firewall ports are open (2049 v4 and 2049, 20048 and 111 for v3)
telnet <nfs server> <port> is useful to try

or

 10s        10s     1   {kubelet k8dev.rhs}         Warning     FailedMount Unable to mount volumes for pod "nfs-bb-pod1_default(3c3710f6-dfd3-11e5-a7b5-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /home/screeley/data/pods/3c3710f6-dfd3-11e5-a7b5-52540092b5fb/volumes/kubernetes.io~nfs/nfsvol nfs []
Output: mount.nfs: access denied by server while mounting nfs1.rhs:/opt/data12

resolution hint: Check the NFS Server exports, likely that the host/node was not added. (/etc/exports).  Rerun exportfs -ra on NFS Server after updated.

rootfs commented 8 years ago

The hint idea is interesting. Do we build the diagnostic knowledge database on kubelet or somewhere else?

thockin commented 8 years ago

I didn't quite follow everything here, but a couple thoughts.

Parsing logs is sketchy and doomed to fail eventually. Last resort.

The logic for this error message generation must live entirely within volume plugins. If that means we need to change the way the API works to make that happen, so be it (as long as it is reasonable).

On Thu, Apr 7, 2016 at 12:41 PM, Huamin Chen notifications@github.com wrote:

The hint idea is interesting. Do we build the diagnostic knowledge database?

— You are receiving this because you are on a team that was mentioned. Reply to this email directly or view it on GitHub https://github.com/kubernetes/kubernetes/issues/23982#issuecomment-207061486

pmorie commented 8 years ago

@screeley44 and I spent some time talking about this one today, I suggested that he try slurping the last 10 lines or 1k out of the file and tacking it onto the error message returned by mount

jeffvance commented 8 years ago

@matchstick I'd like to reassign this to @screeley44 but don't have permissions.

screeley44 commented 8 years ago

I think this can be closed, it is no longer relevant after the refactoring changes

screeley44 commented 8 years ago

actually, let's reference #26786

jeffvance commented 8 years ago

@saad-ali Can you close this issue? According to @screeley44 it has been addressed in refactoring work.

kubernetes / kubernetes

UXP: glusterfs vague volume mount error and true cause not being exposed to user via describe event #23982