Open genevb opened 2 months ago
Thank you for the report Gene.
I can reproduce the output number 9801
when I run your test job on an interactive node, and I get 9809
with 64b env:
$ setenv NODEBUG yes
$ setup 64b
$ starver SL23e
$ root4star -b -q -l 'bfc.C(1,"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt","/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq")' | & grep -a StiStEvent | head -3
StiStEventFiller::fillEvent() -I- Number of filled as global(1):9809
StiStEventFiller::fillEvent() -I- Number of filled as global(2):6035
StiStEventFiller::fillEvent() -I- Number of filled GOOD globals:4238
Unless you already verified, I'd make sure that the environments are indeed the same by printing them with printenv
. Can you share the script and the command you used to submit the test job to condor?
Environments: I did check this earlier today, both for "set" and "setenv", and there are a lot of differences. I tried looking for things that were common across my batch tests, but different for my interactive tests. Here are the environment variable that match that pattern from what I saw:
An example of something that didn't match the pattern of interest: PATH is different for the CRS job, but is consistent between the condor, STAR scheduler, and interactive jobs.
Nothing from "set" matched the pattern of interest.
When running the job in condor, there is a patch of code that I found in the .csh files that the STAR Scheduler generates which properly sets up the $HOME environment variable. So you have options for running batch:
simulateSubmission="true"
just to generate a csh file with the code that sets $HOME, and then execute that csh file from condor submission file.The STAR Scheduler places a whole bunch of other stuff in the csh file that you don't need for this test. So for option 2, you could chop off all the other stuff except the $HOME stuff and the few lines of user code (that's what I did when I submitted directly to condor). Regardless, here's a STAR-scheduler submission xml file, my.xml
:
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE note [
]>
<job maxFilesPerProcess="1" minFilesPerProcess="1" filesPerHour="10" name="batchTest" simulateSubmission="false" fileListSyntax="paths" >
<command>
setenv NODEBUG yes
starver SL23e
root4star -b -q -l 'bfc.C(1,"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt","/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq")'
</command>
<input URL="file:/star/scratch/genevb/dummy"/>
<stdout URL="file:/star/scratch/genevb/out" />
</job>
Then simply execute star-submit my.xml
Thank you for the additional information Gene. Could you remind me if we should expect numerical differences between optimized and non-optimized libraries?
Hi, Dmitri
On Feb 27, 2024, at 1:00 PM, Dmitri Smirnov @.***> wrote: Could you remind me if we should expect numerical differences between optimized and non-optimized libraries?
I can't remind you because I can't remember that detail. But it seems plausible that optimization could lead to different rounding errors at the least significant bits from performing math differently (more efficiently), and that could cause some value to be on one side or the other of a threshold - I'm not sure to what degree we should see that in track counts.
-Gene
I do see some significant difference in the counts when I set/unset NODEBUG. Specifically, using the same test as above, I get 9801 and 9812 respectively. In both cases I run on an interactive node and the NODEBUG variable is the only switch I toggled. According to the logs, the libraries are picked up from the expected STAR_LIB
/STAR_lib
locations. STAR_BIN
and LD_LIBRARY_PATH
also look as expected in the both tests.
One difference in the logs which I wouldn't expect is these lines:
StMagUtilities::deltaVGG = 0.0235596 V (east) : 0.0235596 V (west)
StMagUtilities::deltaVGG = 0.0232773 V (east) : 0.0232773 V (west)
Could this be the reason for the difference in observed counts?
Following the observation from @plexoos , I checked on the differences in StMagUtilities::deltaVGG
between unoptimized and optimized 32-bit. It turns out to be an artifact of a very simple calculation performed in StMagUtilities::GetE()
that probably gets optimized by the compiler. The calculation is normally performed using Float_t
type data members of the StMagUtilities class. When I change those class variables to be Double_t
type, the numbers change slightly, but still don't match precisely between unoptimized and optimized. When I change to performing the calculation with local-to-the-function double
data types, the numbers do match precisely. Therefore, the differences in StMagUtilities::deltaVGG
are definitively due to some optimization of very simple math.
However, this modification results in no impact on the track counts.
My investigation of StMagUtilities::deltaVGG
demonstrates that optimization can lead to some small numerical differences (there may be a large number of such little differences), but unfortunately provides no smoking gun for the difference between running in batch vs. interactive.
Other than the environment variables, I would check contents of /proc/cpuinfo
.
I would check contents of /proc/cpuinfo.
Yes, but what difference do you expect? A different architecture? 🙂 Also, I think it would be a bad joke if SDCC provided "incompatible" (in whatever sense...) machines for interactive and farm nodes.
I already checked the system libs versions. libc c++ all appear to be identical...
Here is a couple of other unlikely things I can think of...
That is to check if different microarchitecture is used, yes.
Two things....
First, the /proc/cpuinfo is available here if you want to look - there are a lot of differences:
~genevb/public/Issue660/cpuinfo.batch
~genevb/public/Issue660/cpuinfo.interactive
Second, I ran the chain with "debug2" and found that there are no differences in the TPC hits, but there are differences beginning in the CA track seed finding. By putting in a few print statements, I was able to conclude that...
The codes inside TPCCATracker are awash with ifdef
statements that make it difficult for a novice like me to find which lines of code are important, to find where to put some informative print statements to dig further. So that's about as far as I can go without investing way more time, my conclusion being that CA codes include/exclude hits differently between the two. Perhaps vectorization is different.
I've conducted a brief review of the Vc library code and observed that it includes checks for CPU vectorization capabilities. Additionally, it appears that the code can distinguish between CPUs manufactured by AMD and Intel. While I'm uncertain whether this information is actually employed to generate distinct code at runtime or the rationale behind it, there's a possibility, given my attempt to eliminate other variables by running in a container. Specifically, my tests consistently yield differing results when executing the test job within the container /cvmfs/singularity.opensciencegrid.org/star-bnl/star-sw:SL23d on Intel and AMD CPUs, respectively. It may be worth mentioning that the Vc code we're currently using was released at least a couple of years before the CPU models used in the test.
For the record, the command executing the test in the container:
singularity exec -B /star/data03/daq -e /cvmfs/singularity.opensciencegrid.org/star-bnl/star-sw:SL23d bash -l -c 'cp /star-sw/StRoot/macros/.rootrc ./ && root4star -l -b -q "bfc.C(1,\"pp2022a,StiCA,BEmcChkStat,-mtd,-btof,-etofA,-picoWrite,-dedxy2,-hitfilt\",\"/star/data03/daq/2022/036/23036038/st_physics_23036038_raw_3500011.daq\")" >& log'
I discussed running interactively on a node reserved for batch (the "spool" nodes) with some SDCC folks and we got this done. Conclusion:
Running interactively on that node gave the same result as running in condor.
That fits with the idea that vectorization is performed slightly differently for the Intel processors on the batch nodes than for the AMD processors on the interactive nodes. In that case, this is probably not worth pursuing much further, and I'll close the issue in a few days if no one has any further ideas/comments.
Thanks, @plexoos , for spending some time on this too.
I've found that running a test job in batch, which can be condor directly, or condor through the STAR Scheduler or through CRS (for starreco only), gets different results than running the job interactively, for things printed in the log files like tracks counts. The implication is that something is different in the batch environment. I should note that the batch jobs all get the same results as each other.
Things I've tested:
genevb
andstarreco
accounts using SL23e optimized. I see the same patterns regardless, andgenevb
andstarreco
are identical to each other.Summarizing these observations: there are various comparisons for which rounding differences cause some slight differences in results, but batch vs. interactive execution should not lead to such differences. Yet in only one comparison test did I see batch and interactive identical to each other (64-bit unoptimized).
Test job:
A simple thing to check:
...gives these results: interactive:
condor:
STAR scheduler: