sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
282 stars 89 forks source link

broken builds #1069

Closed ikalash closed 3 months ago

ikalash commented 3 months ago

It looks like the blake builds broke while I was on travel last week, and one of the CEE builds: https://sems-cdash-son.sandia.gov/cdash//index.php?project=Albany . It would be helpful if someone could try to look into what is going on with blake. It appears some of the compilers are broken, maybe b/c the modules changed? The "no space left on device" error on cee-compute003 and blake is strange - that they are exactly the same. I can try to move the CEE build to another CEE compute node. I suspect if there is really a blake space issue, it is not due to us but due to other users using the /projects space, but I am not sure.

@jewatkins @mcarlson801

jewatkins commented 3 months ago

definitely something weird is going on. I can try to take a look.

bartgol commented 3 months ago

Long shot: some tmp folders are mounted as a tmpfs filesystem, with limited size (for security reasons). I don't think this is the case, since df -h does not show /tmp being mounted as tmpfs. There are tmpfs mounted in /run/user/XXXXX, with different users having different XXXXX numbers, but I don't see those folders being symlinked to/from the folder where the error occurs.

Looking at the output of df -h, it seems like usage on all filesystems is quite low, so unless they cleaned up in the last few hours, the FS should have plenty of space.

jewatkins commented 3 months ago

I was looking at the same thing. Not sure how to proceed other than to try to reproduce the error.

jewatkins commented 3 months ago

I wasn't able to reproduce those errors on my home directory (trilinos-gcc builds fine and trilinos-intel configures fine). Maybe we can give it another day.

I run the gcc/intel builds in parallel (two separate jobs) so it might be that the two builds together are using too much tmp memory. I can try to stagger them tomorrow if we still see the issue.

ikalash commented 3 months ago

It might be best to open an issue with blake-help. They may have some idea about what may be causing this kind of error.

jewatkins commented 3 months ago

blake looks clean today. as far as the cee builds, it looks like /tmp is full. maybe cee can purge? (I sent a ticket)

ikalash commented 3 months ago

Just FYI - I moved the cee-compute003 build to cee-compute030. It seems the problem is gone there. I've been waiting to do the move for awhile, but the node was not configured to be able to push to the CDash site. The sysadmins finally fixed it this week. It seems the out of space issue is showing up still, not on cee-compute005.

jewatkins commented 3 months ago

Something weird is going on with compute004. tmp is full again (with old files) but I could have sworn that I verified last week that they cleaned it.

bartgol commented 3 months ago

I think the tmp folder is routinely cleaned by the OS. I wonder if changes in trilinos (or compilers/tpl-packages, if we had any recent update) caused more temp files to be generated? Seems unlikely tough...

jewatkins commented 3 months ago

@ikalash based on the ticket recommendation, I think if you set TMPDIR in your environment to something like /fgs/ikalash/tmp, you should have enough memory to build. You might want to also purge the tmp directory in your scripts after the build finishes.

ikalash commented 3 months ago

@jewatkins yes, I will do this now. So far it's only needed on cee-compute004 and cee-compute005, correct?

jewatkins commented 3 months ago

@jewatkins yes, I will do this now. So far it's only needed on cee-compute004 and cee-compute005, correct?

yes, that's what it looks like.

ikalash commented 3 months ago

I did it for all CEE builds. It appears /fgs is shared b/w nodes so I am not sure if there will be conflict issues with multiple nodes writing data there, but we will see.

jewatkins commented 3 months ago

I think it should be okay. each build should create a unique dir within tmp

jewatkins commented 3 months ago

although you might not want to delete until all builds are finished... so maybe purge the tmp at the start of your scripts.

ikalash commented 3 months ago

@jewatkins : good idea. I have done this.

ikalash commented 3 months ago

Issue has been fixed - closing.