Multiproccesing in Skyfield

phantomvlad commented 12 months ago

Hi, I am using the library to forecast the position of navigation satellites during the day. My code works sequentially. I calculate the data for a day for all satellites in 3 minutes. When my script is running, all processor cores (4) are loaded at one hundred percent. I also use the script on a computer with 16 cores, but there is a similar story, which counts as 3 minutes. Please tell me if multiprocesing was built in, if so, why the calculation speed did not increase (16 cores). And if not, then why does the script load all 16 cores even though I don’t use multiprocessing?

def forecast_sat_in_timestamp(one_sat, timestamp):
    satellite = EarthSatellite(one_sat.line1, one_sat.line2)
    ts = load.timescale()
    date_for_calculate = ts.utc(timestamp.year, timestamp.month, timestamp.day, timestamp.hour, timestamp.minute, timestamp.second)
    position = satellite.at(date_for_calculate)
    subpoint = position.subpoint()
    x, y, z = subpoint.itrs_xyz.km[0] * 1000, subpoint.itrs_xyz.km[1] * 1000, subpoint.itrs_xyz.km[2] * 1000
    return x, y, z

phantomvlad commented 12 months ago

I found out that the ts.utc() function gives such tension on the cores. Why and how, I have no idea yet

mikeawalker commented 11 months ago

I found this a while ago but didn't report it because I was in a crunch at work. We see it in a similar scenario to yours. Whenever we start using ITRS positions in skyfield we get slammed cores.

I've been looking into it as well but no root cause yet

brandon-rhodes commented 11 months ago

@phantomvlad — Computing a very exact ITRS position requires computing precession and nutation for the time and date, which is an expensive calculation. I should give Skyfield users a way to skip the high precision nutation routine and use a simpler one for cases where, instead of doing radio astronomy that requires sub-arcsecond precision, they are doing something more straightforward like satellite work.

Note that if you ever want to ask about several satellites at the same timestamp, and you can build a single date_for_calculate object, then it will cache the nutation result the first time it needs it, and further use of that date_for_calculate will be able to use the value for free.

As for the high CPU load: that might be a fault in NumPy and how it's trying to accelerate the matrix multiplies. Here's a setting that some folks have found useful:

https://rhodesmill.org/skyfield/accuracy-efficiency.html#using-too-many-cpu-cores

Soon the docs will be updated with an additional setting, which a user offered here:

https://github.com/skyfielders/python-skyfield/issues/916#issuecomment-1812445190

Let me know if one of those works for you!

tutucea commented 11 months ago

for me when processing lots of data and printing output, it looks like an old laser printer while if i use other software its instant ( btw its not a put down just sharing, also maybe its not necessarily skyfield but the other modules used )

mikeawalker commented 11 months ago

Confirming that the OPENBLAS_NUM_THREADS environment variable is the root cause of the issue that I'm seeing

Here is a minimal example that triggers the effect:

import skyfield.api as sf
import datetime
import time
import pytz
def main(  ):
    id = "ISS (ZARYA)"
    n = 100
    ts = sf.load.timescale( )
    satellites = sf.load.tle_file("https://celestrak.org/NORAD/elements/gp.php?GROUP=stations&FORMAT=tle")
    by_name = {sat.name: sat for sat in satellites}

    sat = by_name[id] 

    start = datetime.datetime.utcnow()
    start = start.replace(tzinfo=pytz.utc)

    for k in range( 0 , n ):
        t = start + datetime.timedelta(seconds=k)
        tSF = ts.from_datetime( t )
        state = sat.at( tSF )

        lat,lon = sf.wgs84.latlon_of( state) 
        alt = sf.wgs84.height_of( state ) 
        print( f"{lat},{lon},{alt}")
        time.sleep(1)

if __name__=="__main__":

    main( )

If i run this example with:

export OPENBLAS_NUM_THREADS=0

I get the following overusage:

If i change the openblas settings then its fixed:

export OPENBLAS_NUM_THREADS=1

Usage is extremely low as expected

brandon-rhodes commented 11 months ago

@mikeawalker — Thank you for the working example, and for the screenshots of your system. That is a startling amount of load, and very poor behavior on the part of NumPy (or maybe an underlying library it's not in sufficient control of?).

One thing confuses me in your first screenshot: the top process shows 97.9% CPU, yet none of the CPUs on the system look that heavily loaded. Up in the bar chart above, no single CPU is under more than 13.9% load, and most of them are at only half that, or less. Is your load program maybe misunderstanding, and process 246873 is somehow the sum of the other seven python3 uut.py threads — in which case the process/thread list would be showing twice the load that the system is actually suffering?

mikeawalker commented 11 months ago

A little more data bout my system.

I'm running VMware with 8 processors and 1 core per processor. The OS is Ubuntu 22.04.

I also have another data point. In late 2023 we ran about 40 services running a slightly more complex script in a scaled simulation in a beefy AWS VM. The thread contention got considerably worse when we did this. Knowing what we know now I'd speculate that openblas was trying to use as many cores as it could for all 40 services.

The load measurement is using HTOP. I got unlucky with the screenshot it looks like htop updates the bar chart before the process list but I can confirm (just verified) that when the process list maxes out the bar charts hit the same number. The load is also quite inconsistent and varies between 1% and 16% as I run my test.

I also verified that the number of instances of uut.py is equal to the value in export OPENBLAS_NUM_THREADS=4. Here's another screenshot of me setting the value to 4 and watching the processes.

brandon-rhodes commented 11 months ago

Thanks for the additional information and diagram! So — just to see if I have the concept clear in my mind — it's because the bar graphs are always a little out of sync with the process list, that I don't see any CPU at 22.6%?

mikeawalker commented 11 months ago

If the bars and process list are out of sync that's just htop being a bit delayed.

The total usage percentage is the percent usage with 1 core=100% so on multi core systems this can be greater than 100%. You should see the sum of all bars equal to the total usage.

I get confused about this in htop so I double checked stack overflow https://unix.stackexchange.com/questions/146085/top-command-on-multi-core-processor/146090#146090

mikeawalker commented 11 months ago

Just a quick update for more info - I found a relevant thread in numpy's repo from way back on this.

https://github.com/numpy/numpy/issues/8120

skyfielders / python-skyfield

Multiproccesing in Skyfield #919