Closed ikalash closed 3 months ago
definitely something weird is going on. I can try to take a look.
Long shot: some tmp folders are mounted as a tmpfs filesystem, with limited size (for security reasons). I don't think this is the case, since df -h does not show /tmp being mounted as tmpfs. There are tmpfs mounted in /run/user/XXXXX
, with different users having different XXXXX numbers, but I don't see those folders being symlinked to/from the folder where the error occurs.
Looking at the output of df -h, it seems like usage on all filesystems is quite low, so unless they cleaned up in the last few hours, the FS should have plenty of space.
I was looking at the same thing. Not sure how to proceed other than to try to reproduce the error.
I wasn't able to reproduce those errors on my home directory (trilinos-gcc builds fine and trilinos-intel configures fine). Maybe we can give it another day.
I run the gcc/intel builds in parallel (two separate jobs) so it might be that the two builds together are using too much tmp memory. I can try to stagger them tomorrow if we still see the issue.
It might be best to open an issue with blake-help. They may have some idea about what may be causing this kind of error.
blake looks clean today. as far as the cee builds, it looks like /tmp is full. maybe cee can purge? (I sent a ticket)
Just FYI - I moved the cee-compute003 build to cee-compute030. It seems the problem is gone there. I've been waiting to do the move for awhile, but the node was not configured to be able to push to the CDash site. The sysadmins finally fixed it this week. It seems the out of space issue is showing up still, not on cee-compute005.
Something weird is going on with compute004. tmp is full again (with old files) but I could have sworn that I verified last week that they cleaned it.
I think the tmp folder is routinely cleaned by the OS. I wonder if changes in trilinos (or compilers/tpl-packages, if we had any recent update) caused more temp files to be generated? Seems unlikely tough...
@ikalash based on the ticket recommendation, I think if you set TMPDIR
in your environment to something like /fgs/ikalash/tmp
, you should have enough memory to build. You might want to also purge the tmp directory in your scripts after the build finishes.
@jewatkins yes, I will do this now. So far it's only needed on cee-compute004 and cee-compute005, correct?
@jewatkins yes, I will do this now. So far it's only needed on cee-compute004 and cee-compute005, correct?
yes, that's what it looks like.
I did it for all CEE builds. It appears /fgs is shared b/w nodes so I am not sure if there will be conflict issues with multiple nodes writing data there, but we will see.
I think it should be okay. each build should create a unique dir within tmp
although you might not want to delete until all builds are finished... so maybe purge the tmp at the start of your scripts.
@jewatkins : good idea. I have done this.
Issue has been fixed - closing.
It looks like the blake builds broke while I was on travel last week, and one of the CEE builds: https://sems-cdash-son.sandia.gov/cdash//index.php?project=Albany . It would be helpful if someone could try to look into what is going on with blake. It appears some of the compilers are broken, maybe b/c the modules changed? The "no space left on device" error on cee-compute003 and blake is strange - that they are exactly the same. I can try to move the CEE build to another CEE compute node. I suspect if there is really a blake space issue, it is not due to us but due to other users using the /projects space, but I am not sure.
@jewatkins @mcarlson801