timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-26238] Files are deleted, seemingly randomly, outside the jobs workspace #5752

Open timja opened 9 years ago

timja commented 9 years ago

1. we discovered files outside of the workspace folders on the linux jenkins master being deleted randomly
2. we caught jenkins red-handed using a low level linux kernel logging feature. I don't recall which one, precisely. 'strace'(??)
3. we disabled workspace cleanup
4. the server ran like a charm for a year almost.
5. workspace cleanup was re-enabled by accident.
6. files started to be deleted randomly again outside the jobs workspace folders
7. we remembered the symptoms, disabled workspace cleanup.
8. the server runs like a charm again.

Disabling the workspace cleanup was done by setting -Dhudson.model.WorkspaceCleanupThread.disabled=true


Originally reported by oharboe, imported from: Files are deleted, seemingly randomly, outside the jobs workspace
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 9 years ago

danielbeck:

Well, this is weird. This report includes the version of Jenkins you're using (1.546), and the WorkspaceCleanupThread got a major rewrite in 1.551. So my advice would be to upgrade to a recent weekly (1.59x) or LTS release (1.580.x).

Still, here's a few things to investigate or clarify on your current system:


If the symlink detection fails, it may traverse symlinks to directories when recursively deleting workspaces. Do/did you have any symbolic links to any of the affected files from their workspaces? What JRE (vendor + version) is Jenkins running on?


How is the option Workspace Root Directory configured? Manage Jenkins » Configure System » Advanced (near the top).

What is the output of the following script in Script Console:

jenkins.slaves.WorkspaceLocator.all()

Run the following script in Script Console:

Jenkins.instance.getAllItems(AbstractProject).each { println it.customWorkspace + ' ' + it.fullName }
return

Anything having a custom workspace (first part of output is not null)? If so, are any of these paths pointing to a folder that also exists on master, irrespective of whether the job ever runs there?


Could you clarify the following:

files outside of the workspace folders on the linux jenkins master being deleted randomly

In the PR discussion, you mentioned .git/config being deleted from a directory inside the job directory but outside the workspace. Were other files also affected? Other projects? If so, do these projects have something in common?


Do you still have the old jobs (archived "jenkins.log" will do) from the time Jenkins deleted files? Could you provide them (after changing all project names, host names, etc.)?

timja commented 9 years ago

oharboe:

1. We shouldn't be using any symlinks, no.

2. JRE:

java.runtime.name OpenJDK Runtime Environment
java.runtime.version 1.6.0_24-b24

3. How is the option Workspace Root Directory configured? Manage Jenkins » Configure System » Advanced (near the top). =>

${ITEM_ROOTDIR}/workspace

4. jenkins.slaves.WorkspaceLocator.all() (when run on the master, running on slaves gives NPE) =>

Result: []

5. On the master: "Jenkins.instance.getAllItems(AbstractProject).each

{ println it.customWorkspace + ' ' + it.fullName }

return"

...
"workspace/disk_usage disk_usage" => 'disk_usage' does not exist on master under jenkins/, but workspace/disk_usage does exist
"gerritmirror gerritmirror" => this exists under jenkins/gerritmirror on the master but is empty. It also exists as jenkins/workspace/gerritmirror
"workspace/stability_ck release_stability_test" => I can't find this folder other than under jobs/release_stability_test.

6. "In the PR discussion, you mentioned .git/config being deleted from a directory inside the job directory but outside the workspace. Were other files also affected? Other projects? If so, do these projects have something in common?".

We couldn't really find anything that the deleted files had in common other than that it was in "jobs/admin", but not under the "jobs/admin/workspace" folder. The ".git/config" file is just an example of where we were able to trace back the error message to a file being deleted.

7. W.r.t. log of an incident, the cribs below are the best I can do. What's interesting is that


/var/log/jenkins/jenkins.log show that the workspace clean-up was started around the time that the "release_stability_test" job failed:

INFO: Started Workspace clean-up
INFO: Workspace '/jenkins/workspace/gerritmirror' is being deleted; flushing workspace to revision 0.


timja commented 9 years ago

danielbeck:

java.runtime.name OpenJDK Runtime Environment
java.runtime.version 1.6.0_24-b24

I'd try to change to 1.7, preferably Oracle, ASAP. 1.7 because Jenkins has much nicer implementations of some features when on 1.7, and Oracle because there are several issues with other implementations known (AFAIK none that would be relevant here, but still), and Cloudbees says only that is supported when you pay them money to get support.

INFO: Started Workspace clean-up
INFO: Workspace '/jenkins/workspace/gerritmirror' is being deleted; flushing workspace to revision 0.

Please provide a complete list of plugins (+versions) you're using. This is not a core message. What SCM is configured on the 'gerritmirror' project? What's the output of WorkspaceListener.all() in Manage Jenkins » Script Console?

We couldn't really find anything that the deleted files had in common other than that it was in "jobs/admin", but not under the "jobs/admin/workspace" folder. The ".git/config" file is just an example of where we were able to trace back the error message to a file being deleted.

Is "admin" a job, or a folder? What kind of job is it?


Please confirm: there are no hidden files in folders you declare as being empty?

timja commented 9 years ago

oharboe:

1. I'd try to change to 1.7, preferably Oracle, ASAP.

We're migrating to a new server. The above is from the current production server that we're phasing out.

2. WorkspaceListener.all() => on the master:

Result: [org.jenkinsci.plugins.envinject.EnvInjectListener$JobSetupEnvironmentWorkspaceListener@382b237c]

3. What SCM is configured on the 'gerritmirror' project?

None. It runs some scripts that are copied to the slaves using the copy "Copy files into the job's workspace before building" plugin. The slaves then clone and update some git mirrors from a gerrit server.

timja commented 9 years ago

oharboe:

4. Please provide a complete list of plugins (+versions) you're using.

analysis-core 1.54 true false
ant 1.2 false false
copy-to-slave 1.4.3 true false
credentials 1.10 true true
cvs 2.11 false false
elastic-axis 1.1 true false
email-ext 2.36 true false
envinject 1.89 true false
external-monitor-job 1.2 false false
gerrit-trigger 2.12.0 true false
git 2.0.3 true false
git-client 1.6.3 true false
git-parameter 0.2 true false
htmlpublisher 1.3 true false
javadoc 1.1 false false
ldap 1.4 true true
log-parser 1.0.8 true false
mailer 1.6 true false
matrix-auth 1.1 true false
matrixtieparent 1.2 true false
maven-plugin 2.1 false false
pam-auth 1.1 true false
parameterized-trigger 2.21 true false
perforce 1.3.26 true false
preSCMbuildstep 0.2 true false
python 1.2 true false
scm-api 0.2 true false
slave-squatter 1.2 true false
ssh-agent 1.4.1 true false
ssh-credentials 1.6.1 true true
ssh-slaves 1.5 true false
subversion 1.54 false false
token-macro 1.10 true false
translation 1.10 false false
warnings 4.37 true false
xunit 1.81 true false

5. Is "admin" a job, or a folder? What kind of job is it?

It's a job that runs some scripts to sync back and forth between Gerrit and Perforce. The scripts are stored under jenkins/jobs/admin/.sh. It syncs from Perforce using Perforce command line options. It *does use the Jenkins P4 plugin, but that is purely historical I think. We might be able to switch it not to use the Jenkins Perforce source control plugin.

6. Please confirm: there are no hidden files in folders you declare as being empty?

jenkins/gerritmirror is empty =>

-bash-4.1$ ls -la gerritmirror/
total 8
drwxr-xr-x 2 nosvg-jenkins odin 4096 Nov 25 11:31 .
drwxr-xr-x 17 nosvg-jenkins odin 4096 Dec 29 12:03 ..

timja commented 9 years ago

danielbeck:

INFO: Started Workspace clean-up
INFO: Workspace '/jenkins/workspace/gerritmirror' is being deleted; flushing workspace to revision 0.

FOr each of the lines, provide the line just before. It should contain a date and some logger name.

timja commented 9 years ago

oharboe:

The gerritmirror job has since had the P4 Plugin settings erased so that the random deletion happens even with "SCM: None" for the gerritmirror job.

NB! It was files inside jobs/admin/ we saw disappearing, not files inside the gerritmirror job.

There may have been cases where maintenance staff just wiped the slaves that failed the gerritmirror job because the data in the workspace was corrupted(partial deletion), but I can't be sure.

INFO: Started Workspace clean-up
May 20, 2014 11:03:45 PM hudson.plugins.perforce.PerforceSCM processWorkspaceBeforeDeletion
INFO: Workspace '/jenkins/workspace/gerritmirror' is being deleted; flushing workspace to revision 0.
May 20, 2014 11:03:45 PM hudson.plugins.perforce.PerforceSCM processWorkspaceBeforeDeletion
INFO: Using remote perforce client: jenkins_pkgs-1650401676
May 20, 2014 11:03:45 PM hudson.plugins.perforce.PerforceSCM processWorkspaceBeforeDeletion
INFO: [gerritmirror] $ /usr/local/bin/p4 -P 9BBF567EA44CBD7DA716B93452D05E1E workspace -o jenkins_pkgs-1650401676
May 20, 2014 11:03:45 PM hudson.plugins.perforce.PerforceSCM processWorkspaceBeforeDeletion
SEVERE: null
com.tek42.perforce.PerforceException: Connect to server failed; check $P4PORT
at com.tek42.perforce.parse.AbstractPerforceTemplate.getPerforceResponse
(AbstractPerforceTemplate.java:406)
at com.tek42.perforce.parse.AbstractPerforceTemplate.getPerforceResponse
(AbstractPerforceTemplate.java:301)
at com.tek42.perforce.parse.Workspaces.getWorkspace(Workspaces.java:61)
at hudson.plugins.perforce.PerforceSCM.getPerforceWorkspace(PerforceSCM.
java:1545)
at hudson.plugins.perforce.PerforceSCM.processWorkspaceBeforeDeletion(Pe
rforceSCM.java:3099)
at hudson.model.WorkspaceCleanupThread.shouldBeDeleted(WorkspaceCleanupT
hread.java:125)
at hudson.model.WorkspaceCleanupThread.process(WorkspaceCleanupThread.ja
va:145)
at hudson.model.WorkspaceCleanupThread.execute(WorkspaceCleanupThread.java:72)
at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:53)
at java.lang.Thread.run(Thread.java:679)
May 20, 2014 11:07:22 PM hudson.slaves.SlaveComputer tryReconnect

timja commented 9 years ago

danielbeck:

I assume you'd have noticed Perforce deleting those files, even if it's a child process of Jenkins, so that's out.

I'm out of ideas. I think it would be interesting if you upgraded Jenkins and enabled the cleanup thread again, but I'd understand if you didn't want to do that on the production instance (although I suggest upgrading to a recent Jenkins LTS release anyway).

Any chance you can retain your old box for a bit for some experiments once the move you mention is complete?

timja commented 9 years ago

oharboe:

> Any chance you can retain your old box for a bit for some experiments once the move you mention is complete?

Yes. It's going to be a while though. I'm hoping that will be in the February time-frame.

timja commented 9 years ago

danielbeck:

Editing to test a Jira problem, please ignore