simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Illegal instruction #17

Closed ickc closed 9 months ago

ickc commented 9 months ago

Migrated from Slack

People involved: @earosenberg


Erik Rosenberg 3:57 PM Hi Kolen, new question: I am trying to run toast_ground_schedule on Blackett. I have a script to do this that seems to run fine on the interactive node but fails when I submit it as a job to the worker node. The error is on the toast_ground_schedule line with the error message "Illegal instruction". Is this something you've seen before? This is with your anaconda tarball btw

Kolen Cheung 8:34 PM Oh, wow. I need a reproducible working example for this. I bet if you rerun it you might not be able to get the same error right away. My bet is that it lands on a node with very old processors that lacks some newer instructions? Probably related to AVX512.

8:35 If you still have the log, could you please send me the log? Specifically I want to have the address of the node. I want to see what that node is capable of.

New

Erik Rosenberg 8:47 PM I did submit it a couple times with the same result. Here is the address line (I think) Transferring to host: <195.194.109.209:9618?addrs=195.194.109.209-9618+[2001-630-22-d0ff-5054-ff-fee9-c3d]-9618&alias=vm75.in.tier2.hep.manchester.ac.uk&noUDP&sock=slot1_4_1883_7e66_92777>

8:48 (the full log file and a bit more info here if helpful) Gzip

preprocess_schedule_info.tar.gz

ickc commented 9 months ago

Hi, @earosenberg, I also need other files involved in that job including get_schedule.sh

earosenberg commented 9 months ago

Sure, here are all the files; apologies for not providing them earlier. preprocess_schedule_full.tar.gz I actually ran it again and it seemed to work this time but hopefully you can reproduce the error -- the new log file for the run that worked is in this tarball

rwf14f commented 9 months ago

As @ickc suspected, this appears to be an AVX problem. vm75 only has AVX, but your packaged conda environment seems to require at least AVX2. All other standard nodes on the testbed have at least AVX2, but not necessarily AVX512. I've disabled vm75 for now and will remove it from the testbed later (this was planned anyway).

ickc commented 9 months ago

The current software environment I provided are compiled with AVX512, where at the time I prepared them, all available nodes to us are AVX512 capable. But that has recently changed as more nodes is made available to us.

There's 2 ways to tackle this in my mind but I haven't got time to reach this stage yet:

  1. Use has_avx512f from Machine ClassAd Attributes — HTCondor Manual 23.1.0 documentation to request only AVX-512 capable nodes.
  2. Prepare both AVX2 and AVX512 software environments and dispatch in runtime which one to load.

(1) can be used in the interim, and I plan to do (2) in the future upon testing. We will have more concrete plans when our machines arrive.

I will try to document (1) soon.

rwf14f commented 9 months ago

Please be aware that the majority of cores (~80%) of the current production cluster are AMD nodes (Zen 2 and Zen 3) which only support avx2 and will be unavailable if you use (1). This percentage will go up when we remove more nodes with Intel CPUs that are going out of warranty.

ickc commented 9 months ago

While I don't know the percentage before, I'm aware it probably won't be the majority of the nodes. My plan in the long run is to auto-dispatch it at run-time so that optimal compilation is selected automatically. Our procurement is still in process and this would depends also on what kind of machines we will buy. There's more incentive to do this if we are buying Intel nodes. It also will depends on benchmarks after our nodes arrived, but I anticipated AVX-512 would give significant performance boost in our stack from prior experience.

In a sense it also depends on how the fair-share system works in practice. If I have access to a smaller portion of the machine with longer queue time, but trading off with better performance hence smaller compute hours meaning that I can get more jobs done with the same fair-share, then it would be worth it. At NERSC there's something called NERSC hours and allocation year, so the concept of minimizing cost per job is very objective and easy to understand. But at Blackett it is not clear how that cost model would work out in practice, making optimization (more job given same fair-share) more difficult to reason with.

ickc commented 9 months ago

Closing, remaining items tracked in #7