openhpc / ohpc

OpenHPC Integration, Packaging, and Test Repo
http://openhpc.community
Apache License 2.0
840 stars 185 forks source link

module try-add ohpc hangs, preventing login #1150

Open dhilst opened 4 years ago

dhilst commented 4 years ago

/etc/profile.d/lmod.sh has a module try-add ohpc. When it executes this line it hangs,

If I hit Ctrl+c I got this stack trace, any ideas?

^C/bin/lua: interrupted! stack traceback:

    [C]: in function 'dir'
    /opt/ohpc/admin/lmod/lmod/libexec/DirTree.lua:183: in function 'walk'
    /opt/ohpc/admin/lmod/lmod/libexec/DirTree.lua:230: in function 'walk_tree'
    /opt/ohpc/admin/lmod/lmod/libexec/DirTree.lua:260: in function 'build'
    /opt/ohpc/admin/lmod/lmod/libexec/DirTree.lua:271: in function 'new'
    /opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:606: in function '__new'
    /opt/ohpc/admin/lmod/lmod/libexec/ModuleA.lua:676: in function 'singleton'
    /opt/ohpc/admin/lmod/lmod/libexec/MName.lua:166: in function 'lazyEval'
    /opt/ohpc/admin/lmod/lmod/libexec/MName.lua:241: in function 'sn'
    /opt/ohpc/admin/lmod/lmod/libexec/Master.lua:296: in function 'load'
    /opt/ohpc/admin/lmod/lmod/libexec/MasterControl.lua:1008: in function 'load'
    /opt/ohpc/admin/lmod/lmod/libexec/MasterControl.lua:984: in function 'load_usr'
    /opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:455: in function 'Load_Usr'
    /opt/ohpc/admin/lmod/lmod/libexec/cmdfuncs.lua:402: in function 'cmd'
    /opt/ohpc/admin/lmod/lmod/libexec/lmod:512: in function 'main'
    /opt/ohpc/admin/lmod/lmod/libexec/lmod:570: in main chunk
koomie commented 4 years ago

I don't belive I've encountered this before. Am I understanding correctly that if you comment out the module try-add ohpc from the startup script, all is well?

If that's the case, does module show ohpc error off?

dhilst commented 4 years ago

Hi @koomie Thanks for the reply,

I really can't reproduce this. I hit it sometimes, I don't know if happens when something goes wrong during the installation or what... I thought that was related to NFS, but it happened on headnode too, so it seems to not.

From the stacktrace it seems to trying to traverse a directory when it hangs. Maybe a loop in the filesystem? (cause by a symlink, it's just a guess)