Is your feature request related to a problem? Please describe.
As currently configured, multi-node glideins such as HEPCloud runs at NERSC and TACC Frontera have the problem that all
of their _condor_stdout and _condor_stderr get written on top of each other in a single file. This can make debugging difficult considering that we go up to as many as 100 nodes in a single glidein.
Describe the solution you'd like
Would like _condor_stdout and _condor_stderr to be tagged individually by each node. In such a configuration condor transfer output would pull multiple such files back to the factory rather than having just one such file.
Describe alternatives you've considered
This will be somewhat tricky as different multi-glide in setups are used on each host. NERSC SLURM setup
executes a srun command on each of the 100 nodes. TACC FRONTERA by contrast starts on one node and uses the TACC in-built launching mechanism to run glidein_startup.sh on each of the 28 nodes in our jobs there.
Info (please complete the following information):
Stakeholders and components can be a comma separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line by its own.
Priority: Medium
Stakeholders: HEPCloud, CMS
Components: The affected component, if any, from this feature [frontend, factory, glidein, documentation, CI, testing, release, factory monitoring, frontend monitoring, ...]. Glidein, possibly factory
Additional context
Add any other context or supporting files about the feature request here.
Is your feature request related to a problem? Please describe.
As currently configured, multi-node glideins such as HEPCloud runs at NERSC and TACC Frontera have the problem that all of their _condor_stdout and _condor_stderr get written on top of each other in a single file. This can make debugging difficult considering that we go up to as many as 100 nodes in a single glidein.
Describe the solution you'd like Would like _condor_stdout and _condor_stderr to be tagged individually by each node. In such a configuration condor transfer output would pull multiple such files back to the factory rather than having just one such file.
Describe alternatives you've considered This will be somewhat tricky as different multi-glide in setups are used on each host. NERSC SLURM setup executes a srun command on each of the 100 nodes. TACC FRONTERA by contrast starts on one node and uses the TACC in-built launching mechanism to run glidein_startup.sh on each of the 28 nodes in our jobs there.
Info (please complete the following information): Stakeholders and components can be a comma separated list or on multiple lines. If you add a new stakeholder or component, not on the sample list, add it on a line by its own.
Additional context Add any other context or supporting files about the feature request here.