Batch and interactive results differ

genevb commented 2 months ago

I've found that running a test job in batch, which can be condor directly, or condor through the STAR Scheduler or through CRS (for starreco only), gets different results than running the job interactively, for things printed in the log files like tracks counts. The implication is that something is different in the batch environment. I should note that the batch jobs all get the same results as each other.

Things I've tested:

Is this true in both 32-bit and 64-bit modes? Yes, I see the same patterns that batch is different from interactive and various batch are consistent with each other... though we've known for a long time that 32-bit and 64-bit are not identical to each other due to rounding differences.
Is this true in various libraries? Yes, I tried both SL23e and SL23f, which are different primarily in the use of the spack environment for the latter. I see the same patterns regardless, and SL23e and SL23f are identical to each other.
Is this true for different users? Yes, I tried both the genevb and starreco accounts using SL23e optimized. I see the same patterns regardless, and genevb and starreco are identical to each other.
Is this true in both unoptimized and optimized? Here it gets a big confusing....While 32-bit unoptimized batch results differ from 32-bit unoptimized interactive results, this is not true in 64-bit un-optimized: batch unoptimized results do match interactive results and these both match the 64-bit optimized interactive results (but not the 64-bit optimized batch results, so 3-out-of-4 64-bit results match each other). This is the only situation where I saw batch and interactive match each other. Also, 32-bit optimized and unoptimized are not identical to each other.

Summarizing these observations: there are various comparisons for which rounding differences cause some slight differences in results, but batch vs. interactive execution should not lead to such differences. Yet in only one comparison test did I see batch and interactive identical to each other (64-bit unoptimized).

Test job:

setenv NODEBUG yes
starver SL23e
root4star -b -q -l 'bfc.C(1,"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt","/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq")'

A simple thing to check:

> grep -a StiStEvent log | head -1

...gives these results: interactive:

StiStEventFiller::fillEvent() -I- Number of filled as global(1):9801

condor:

StiStEventFiller::fillEvent() -I- Number of filled as global(1):9780

STAR scheduler:

StiStEventFiller::fillEvent() -I- Number of filled as global(1):9780

plexoos commented 2 months ago

Thank you for the report Gene. I can reproduce the output number 9801 when I run your test job on an interactive node, and I get 9809 with 64b env:

$ setenv NODEBUG yes
$ setup 64b
$ starver SL23e
$ root4star -b -q -l 'bfc.C(1,"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt","/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq")' | & grep -a StiStEvent | head -3
StiStEventFiller::fillEvent() -I- Number of filled as global(1):9809
StiStEventFiller::fillEvent() -I- Number of filled as global(2):6035
StiStEventFiller::fillEvent() -I- Number of filled GOOD globals:4238

Unless you already verified, I'd make sure that the environments are indeed the same by printing them with printenv. Can you share the script and the command you used to submit the test job to condor?

genevb commented 2 months ago

Environments: I did check this earlier today, both for "set" and "setenv", and there are a lot of differences. I tried looking for things that were common across my batch tests, but different for my interactive tests. Here are the environment variable that match that pattern from what I saw:

These are set in interactive, but not batch: KERB5CCNAME, DISPLAY, several SSH_* variables
These are set in batch, but not interactive: MKL_NUM_THREADS, NUMEXPR_NUM_THREAD, OMP_NUM_THREAD, OMP_THREAD_LIMIT, OPENBLAS_NUM_THREADS, _CHIRP_DELAYED_UPDATE_PREFIX, GO_MAX_PROCS, TEMP, TMP, TMPDIR, TF_NUM_THREAD, TF_LOOP_PARALLEL_ITERATIONS, BATCH_SYSTEM and a whole bunch of CONDOR* variables
INTERACTIVE=1 in interactive, and 0 in batch
DOMAINNAME=rcf.bnl.gov in interactive, and sdcc.bnl.gov in batch

An example of something that didn't match the pattern of interest: PATH is different for the CRS job, but is consistent between the condor, STAR scheduler, and interactive jobs.

Nothing from "set" matched the pattern of interest.

When running the job in condor, there is a patch of code that I found in the .csh files that the STAR Scheduler generates which properly sets up the $HOME environment variable. So you have options for running batch:

Use the STAR Scheduler to submit the job, letting it automatically take care of $HOME.
Use the STAR Scheduler with simulateSubmission="true" just to generate a csh file with the code that sets $HOME, and then execute that csh file from condor submission file.

The STAR Scheduler places a whole bunch of other stuff in the csh file that you don't need for this test. So for option 2, you could chop off all the other stuff except the $HOME stuff and the few lines of user code (that's what I did when I submitted directly to condor). Regardless, here's a STAR-scheduler submission xml file, my.xml:

<?xml version="1.0" encoding="utf-8" ?>

<!DOCTYPE note [
]>
<job maxFilesPerProcess="1" minFilesPerProcess="1" filesPerHour="10" name="batchTest" simulateSubmission="false" fileListSyntax="paths" >
    <command>
setenv NODEBUG yes
starver SL23e
root4star -b -q -l 'bfc.C(1,"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt","/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq")'
    </command>

<input URL="file:/star/scratch/genevb/dummy"/>
<stdout URL="file:/star/scratch/genevb/out" />
</job>

Then simply execute star-submit my.xml

plexoos commented 2 months ago

Thank you for the additional information Gene. Could you remind me if we should expect numerical differences between optimized and non-optimized libraries?

genevb commented 2 months ago

Hi, Dmitri

On Feb 27, 2024, at 1:00 PM, Dmitri Smirnov @.***> wrote: Could you remind me if we should expect numerical differences between optimized and non-optimized libraries?

I can't remind you because I can't remember that detail. But it seems plausible that optimization could lead to different rounding errors at the least significant bits from performing math differently (more efficiently), and that could cause some value to be on one side or the other of a threshold - I'm not sure to what degree we should see that in track counts.

-Gene

plexoos commented 2 months ago

I do see some significant difference in the counts when I set/unset NODEBUG. Specifically, using the same test as above, I get 9801 and 9812 respectively. In both cases I run on an interactive node and the NODEBUG variable is the only switch I toggled. According to the logs, the libraries are picked up from the expected STAR_LIB/STAR_lib locations. STAR_BIN and LD_LIBRARY_PATH also look as expected in the both tests.

One difference in the logs which I wouldn't expect is these lines:

StMagUtilities::deltaVGG      =  0.0235596 V (east) : 0.0235596 V (west)

StMagUtilities::deltaVGG      =  0.0232773 V (east) : 0.0232773 V (west)

Could this be the reason for the difference in observed counts?

genevb commented 2 months ago

Following the observation from @plexoos , I checked on the differences in StMagUtilities::deltaVGG between unoptimized and optimized 32-bit. It turns out to be an artifact of a very simple calculation performed in StMagUtilities::GetE() that probably gets optimized by the compiler. The calculation is normally performed using Float_t type data members of the StMagUtilities class. When I change those class variables to be Double_t type, the numbers change slightly, but still don't match precisely between unoptimized and optimized. When I change to performing the calculation with local-to-the-function double data types, the numbers do match precisely. Therefore, the differences in StMagUtilities::deltaVGG are definitively due to some optimization of very simple math.

However, this modification results in no impact on the track counts.

My investigation of StMagUtilities::deltaVGG demonstrates that optimization can lead to some small numerical differences (there may be a large number of such little differences), but unfortunately provides no smoking gun for the difference between running in batch vs. interactive.

veprbl commented 2 months ago

Other than the environment variables, I would check contents of /proc/cpuinfo.

plexoos commented 2 months ago

I would check contents of /proc/cpuinfo.

Yes, but what difference do you expect? A different architecture? 🙂 Also, I think it would be a bad joke if SDCC provided "incompatible" (in whatever sense...) machines for interactive and farm nodes.

I already checked the system libs versions. libc c++ all appear to be identical...

Here is a couple of other unlikely things I can think of...

Does AFS has caching? Could it get out of sync?
Is it possible that we pick up different calibrations? geometry?

veprbl commented 2 months ago

That is to check if different microarchitecture is used, yes.

genevb commented 2 months ago

Two things....

First, the /proc/cpuinfo is available here if you want to look - there are a lot of differences:

~genevb/public/Issue660/cpuinfo.batch
~genevb/public/Issue660/cpuinfo.interactive

Second, I ran the chain with "debug2" and found that there are no differences in the TPC hits, but there are differences beginning in the CA track seed finding. By putting in a few print statements, I was able to conclude that...

Quite a few tracks (hundreds) either pick up or lose a hit
The tracks which are in only one sample or the other appear to generally be very short (e.g. 4 hits), so they probably came up short of actually being a track in the other.
The order of seeds is somewhat re-arranged between the two samples, even within TPC sectors. I don't understand CA well enough to be certain, but perhaps the order can be impacted by which hits are already on another track seed.

The codes inside TPCCATracker are awash with ifdef statements that make it difficult for a novice like me to find which lines of code are important, to find where to put some informative print statements to dig further. So that's about as far as I can go without investing way more time, my conclusion being that CA codes include/exclude hits differently between the two. Perhaps vectorization is different.

plexoos commented 2 months ago

I've conducted a brief review of the Vc library code and observed that it includes checks for CPU vectorization capabilities. Additionally, it appears that the code can distinguish between CPUs manufactured by AMD and Intel. While I'm uncertain whether this information is actually employed to generate distinct code at runtime or the rationale behind it, there's a possibility, given my attempt to eliminate other variables by running in a container. Specifically, my tests consistently yield differing results when executing the test job within the container /cvmfs/singularity.opensciencegrid.org/star-bnl/star-sw:SL23d on Intel and AMD CPUs, respectively. It may be worth mentioning that the Vc code we're currently using was released at least a couple of years before the CPU models used in the test.

plexoos commented 2 months ago

For the record, the command executing the test in the container:

singularity exec -B /star/data03/daq -e /cvmfs/singularity.opensciencegrid.org/star-bnl/star-sw:SL23d bash -l -c 'cp /star-sw/StRoot/macros/.rootrc ./ && root4star -l -b -q "bfc.C(1,\"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt\",\"/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq\")" >& log'

genevb commented 2 months ago

I discussed running interactively on a node reserved for batch (the "spool" nodes) with some SDCC folks and we got this done. Conclusion:

Running interactively on that node gave the same result as running in condor.

That fits with the idea that vectorization is performed slightly differently for the Intel processors on the batch nodes than for the AMD processors on the interactive nodes. In that case, this is probably not worth pursuing much further, and I'll close the issue in a few days if no one has any further ideas/comments.

Thanks, @plexoos , for spending some time on this too.

star-bnl / star-sw

Batch and interactive results differ #660