skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.44k stars 214 forks source link

Improving Execution Time for almanac.find_discrete() #612

Open aendie opened 3 years ago

aendie commented 3 years ago

I have long been aware of the relatively long the execution time in the almanac.find_discrete() function. In comparison to Ephem (which obviously uses different algorithms), it is highly accurate and understandably slower. My ‘use case’ is building the daily pages of a Nautical Almanac, and Event Time Tables that print event times in hh:mm:ss format (more out of curiosity, as event times are always published in hh:mm time format in almanacs). In switching from Ephem to Skyfield, creation of these pages was typically four times slower.

Creation of 4 Nautical Almanac pages (that span 6 days data) takes 11.35 seconds (on my PC, in Windows 10) of which 9.11 seconds are spent in the almanac.find_discrete() function, which accounts for 80.3% time processing this function.

Creation of 3 Event Time Table pages (that span 6 days data) takes 18.12 seconds (on my PC, in Windows 10) of which 16.5 seconds are spent in the almanac.find_discrete() function, which accounts for 91.1% time processing this function.

Some compute-intensive tasks in the Nautical Almanac are:

The compute-intensive tasks in the Event Time Tables are basically the same except that all times are rounded and printed to the second of time (in hh:mm:ss format). Also, “Sunset, Twilight (Civil and Nautical) begin and end, Sunrise” is calculated for every day (not just one of three).

How I approached this issue and the resulting metrics follows…

aendie commented 3 years ago

The PC hardware I use is based on a Gigabyte X570 AORUS ULTRA Motherboard with an AMD Ryzen 3700X 8-Core Processor (with two Threads per CPU Core) and 32 GB RAM. Thus I have 16 CPUs effectively. A good solution for compute-intensive tasks is multiprocessing (assuming the tasks are thread-safe). Moreover, it would be good to cater for the default “spawn”-process mode of Windows 10 as well as the default “fork”-process mode in Linux. A hardware pre-requisite is a dual-boot PC with Windows 10 Pro and Ubuntu 20.04 LTS. Unfortunately, I cannot perform any tests on Mac OS.

I devised a multiprocessing solution worthy of mention. (The strategy is clearer when you view some sample output pages). I focused on a piecemeal approach for suitable processing steps. These are executed in sequence per “logical unit of work” (half a page of Event Tables or per Nautical Almanac page).

Prepare Nautical Almanac for parallel execution

A Nautical Almanac contains 3 days of information on two pages. Here are some “per page” tasks that are executed serially for the LEFTHAND page:

  1. Split the “planets” data calculation into the following parallel tasks: Aries, Venus, Mars, Jupiter, Saturn Each of the 5 parallel tasks calculate for each hour on ‘date’, ‘date+1’ and ‘date+2’: a. the GHA b. the Declination (for all except Aries)
  2. Split the “planets” Meridian Passage calculation into the following parallel tasks: Venus, Mars, Jupiter, Saturn Each of the 4 parallel tasks calculate for ‘date’, ‘date+1’ and ‘date+2’: a. the SHA b. planet transit time

Here are some “per page” tasks that are executed serially for the RIGHTHAND page:

  1. Split the Sun-Moon data calculations into three parallel tasks: three separate days (that are all printed on a single page) Each of the 3 parallel tasks calculate for each hour on ‘date’, ‘date+1’ and ‘date+2’: a. Sun GHA b. Sun Declination c. Moon GHA d. Moon Declination e. Moon Horizontal Parallax f. Moon v and d rates
  2. Split the event time calculations into the following parallel tasks by latitude: 31 latitudes between 72° N and 60° S For each of the 31 parallel tasks, calculate for ‘date+1’ only: a. Sunrise & sunset time b. Civil twilight start/end time c. Nautical twilight start/end time
  3. Split the event time calculations into the following parallel tasks by latitude: 31 latitudes between 72° N and 60° S For each of the 31 parallel tasks, calculate for ‘date’, ‘date+1’ and ‘date+2’: a. Moonrise & Moonset time

Finally, the Equation of Time data is appended serially.

All measurements are made employing Process Pools (using Pool.map). The 31 parallel tasks contribute most effectively to performance gain.

The first and second LEFTHAND page split (“planets” data; “planets” Meridian Passage) as well as the first RIGHTHAND page split (Sun-Moon data) only made the Linux timing worse and thus is only included in the Windows 10 timing results. (I only include parallel processes if they actually save on overall execution time.)

All worker processes are created once only initially and only closed when all computation has completed. (8 maximum for Windows; 12 maximum in Linux)

Prepare Event Time Tables for parallel execution

An Event Time Tables page contains 2 days of information. Here are the tasks that are executed serially:

  1. Split the event time calculations into the following parallel tasks by latitude: 31 latitudes between 72° N and 60° S For each of the 31 parallel tasks, calculate for one date: a. Sunrise & sunset time b. Civil twilight start/end time c. Nautical twilight start/end time
  2. Split the event time calculations into the following parallel tasks by latitude: 31 latitudes between 72° N and 60° S For each of the 31 parallel tasks, calculate for one date: a. Moonrise & Moonset time
  3. Split the “planets” Meridian Passage calculation into the following parallel tasks: Venus, Mars, Jupiter, Saturn Each of the 4 parallel tasks calculate for one date: a. the SHA b. planet transit time

The above sequence is repeated for the second day data of a page, and finally the Equation of Time data (for two days per page) is appended serially.

Execution time measurements were made with Process Pools (using Pool.map) and with a Process Pool Executor (ProcessPoolExecutor.map). The former was a little faster and is thus used in the timing measurements. The 31 parallel tasks contribute most effectively to performance gain.

The third split above (“planets” Meridian Passage) only made the Linux timing worse and is thus only included in the Windows 10 timing results. (I only include parallel processes if they actually save on overall execution time.)

All worker processes are created once only initially and only closed when all computation has completed. (8 maximum for Windows; 12 maximum in Linux)

Measured Performance on Ubuntu 20.04 LTS

The following execution times are from a single sample run and ignore any typical variance in timing. The results on Ubuntu 20.04 LTS are not overwhelming. Based on this, 12 logical processors (CPUs) maximum was chosen for SFalmanac when running on Linux.

TIMING Ubuntu

When multiprocessing (using 12 CPUs) …

The Docker Container (running Debian with 12 CPUs) in a VM on Ubuntu 20.04 is around 10% slower than running on a native (Ubuntu 20.04) operating system. Also, the MP stopwatch times are “fictive”.

Measured Performance on Windows 10 Pro

The following execution times are from a single sample run and ignore any typical variance in timing. The results in Windows 10 Pro are very satisfactory. Based on this, 8 logical processors (CPUs) maximum was chosen for SFalmanac when running on Windows 10.

TIMING Win 10

When multiprocessing (using 8 CPUs) …

The Docker Container (running Debian with 12 CPUs) in a VM on Windows 10 Pro is around 7% to 10% slower than running on a native (Ubuntu 20.04) operating system. Also, the MP stopwatch times are “fictive”.

Conclusion and Outlook

The results in Windows 10 are totally satisfactory and fully compensate for the original premise “that switching from Ephem to Skyfield, creation of my pages was typically four times slower”. This is a remarkably good result seeing that I originally made just a wild guess on what pieces of code could possibly be run in parallel. Furthermore I “played safe” by placing all real-time data into arguments, passing them on from function to function. This has the positive side-effect that the same code runs using the “spawn-process” mode (default on Windows and Mac OS) and also using the “fork-process” mode (default on Linux). The negative side-effect is that the functions look bloated (and ugly) with arguments galore. But hey, it works like magic!

Well, almost. Each multiprocessing example on the Internet states that the (Linux-default) “fork-process” mode is more efficient than employing the “spawn-process” mode. But I also note that the example multiprocessing code is often packed into a single “main” module, which I think is unrealistic for complex real-life multiprocessing scenarios. If this contributes to the problem, I don’t know. However, I am aware that I am passing (“pickle/unpickle”) this timescale object on with every Pool.map iteration to each multiprocessing process:

timescale

It is 293966 bytes large (as a pickled object) and its value never changes within a single execution run. Thus, this is a perfect candidate to be manually pickled just once and placed in a shared memory buffer. This is my next technological tack for investigation (and I hope the performance in Linux will improve).

I have not mentioned if Skyfield is thread-safe. I simply don’t know, but up to now I have had no problems at all (e.g. calculating yearly almanacs and yearly Event Time tables).

Links to SFalmanac (multiprocessing version)

Feel free to experience the thrill of exhilarating multiprocessing execution speeds by downloading any of the following…

Download as Python code modules: https://github.com/aendie/SFalmanac-Py3

Download as a Python Package: https://pypi.org/project/sfalmanac/

Download as a Docker Image: https://hub.docker.com/repository/docker/aendie/sfalmanac

NOTE: these are all capable of also running in single-processing mode: the ‘MULTIpr’ variable in config.py controls this.

UPDATE:

I forgot to mention that I also measured Linux timings using the non-default "spawn"-process methodology. This is not the bugbear I'm seeking ... the performance on Linux didn't improve at all.

Another point: the "stopwatch" multiprocessing times, which I said are "fictive", increase with more CPUs. Shouldn't CPUs process in parallel and time taken by almanac.find_discrete() remain roughly constant? Well no, not if ONE CPU Core has TWO "threads" (or "logical processors" as they are correctly named by Microsoft). Logical Processors are great for multi-threading but pointless for multiprocessing. This shows that one physical CPU core (I have 8 in total) can only process one "thread" at a time, i.e. they interrupt each other. So it is logical to assume (with two threads per CPU core) that the single-processing time spent in almanac.find_discrete() can double at the most. Windows processing for a Nautical Almanac would then consume a maximum of 2 x 9 = 18 seconds. True! Windows processing of Event Time Tables would then consume a maximum of 2 x 16.5 seconds. True! But if you look at the Ubuntu times, this is NOT the case... which implies that something else is going on... interrupting the supposedly parallel processes. Very strange. Okay, think logically. This effect (that Linux only achieves factor 2 or 3 speed optimization) is common to the native Ubuntu as well as the Debian operating system within the Docker image. It is unlikely to be a Python problem because Windows 10 runs perfectly. However, as Ubuntu and Debian are closely related, my next tack (to use a sailing term) will be to build a Docker Image based on another Linux variant.

brandon-rhodes commented 3 years ago

That's an interesting project report! So far as I know Skyfield is thread-safe since Python itself is a thread-safe language — though if you are using multiprocessing then thread safety won't even arise, since it uses processes instead. You could look into whether NumPy might be willing to do things in parallel. I think there was even an issue the other day opened by someone for whom NumPy, without asking, used all of their cores for a matrix operation?

Let me know if there are action items here for Skyfield before the issue can be closed, or if this issue will happily serve its purpose simply by existing as a place for you to make further comments and collect notes about your effort for Skyfield users who might search and find it later.

aendie commented 3 years ago

Thanks for your comment! I am progressing well with further ideas on multiprocessing, just to see how they compare. I now have a third option to multiprocess using separate input and output queues per process. It took me a while to get it working (e.g. with only 2 available CPUs as well as with the full 16 CPUs when I only require four parallel processes for some tasks and all I can get for others). I'm rather amazed that after I learn how to create parallel processes, that it then runs trouble-free.

My fundamental feedback is that Windows 10 performs to full expectations in "pool.map" mode. However I am still determined to see if I can get similar performance from Linux. Thus I will add updates when I get more results. I don't mind if you close this in the meantime - I wasn't expecting to keep it open for long.

I appreciate having a dual-boot PC system as it allows me to compare performance on practically the same hardware. I even wonder if having Windows 10 on an internal NVMe SSD stick and Linux on an internal SATA-connected SSD drive might affect performance. I could purchase a small second NVMe stick and put that on the motherboard too (for a Linux installation). Anyway, thanks for permitting me to host my research on multiprocessing here ... maybe it will be useful to someone eventually.

quantenschaum commented 1 year ago

I would like to comment on this, because I have been looking into this, too. I took a different approach to distribute the work over the processes, I simply split the range of days/pages into n (almost) equally sized parts, compute the data for them in parallel and in the end merge the data together. This way it is only necessary to transfer (pickle) the computed values once per process.

I use Ubuntu on an i7-1185G7, 4 cores with 2 threads each

When running a single process, it uses about 50% of the cpus. I suspect this is due to numpy using multiple threads, as @brandon-rhodes already said. If that is the case on your system, depends on your installation. I think, numpy uses a BLAS/LAPACK backend and there are many of them. If you happen to have one installed that uses multi-threading, numpy and SkyField use it and benefit from it (or not). I am using openBLAS with pthreads. Setting envvar OMP_NUM_THREADS=1 reduces the cpu usage to a single core, as expected. Increased cpu usage does not always mean that the computation is actually done faster, multi-threading may just introduce overhead resulting in higher load but no significant gain in speed.

I could stop here, because this has nothing directly to do with SkyField anymore. SkyField just behaves as the underlying numpy does. But I want to share, what I have found out.

Some quick tests show the following results (generation of 66 days of data)

processes threads time load
1 1 01:38 18
1 4 01:52 25
1 8 02:14 49
4 1 00:37 43
4 2 00:41 57
4 8 01:23 82
8 1 00:35 81
8 8 01:25 92
12 1 00:37 88

Using multiple threads does not make it faster, they slow it down. This can be due to the arrays processed by numpy being too small in this case to benefit from multi-threading. If multi-processing and multi-threading are used simultaneously, the processes do not benefit from it, because they are stealing each other the available cores with their multiple threads. It performs best with least cpu usage when using 4 processes (on 4 cores) with single-threaded numpy.

Gregory-N-able commented 1 year ago

Another thing to consider when running these tests, especially when using Desktop CPUs, although they may have 2 threads per core, both threads are not equivalent. Usually, the first thread is a fully functional thread. The second thread usually is a reduced capability thread that cannot perform all of the functions the first thread can, or receives a much lower priority than the first thread.

You can see this, even in Windows by running a multithreaded application that has heavy CPU utilization (for instance games like Cities Skylines, Cyberpunk 2077, or 3D CAD tools, 3D Printer Slicers, etc.), and watching the CPU utilization across all cores (Windows Task Manager can show this if you select the 'Performance' tab -> CPU -> Right click on the graph -> 'Change graph to' -> 'Logical processors'). Usually, you will see the first thread running near limits, but the second thread is at a much lower level of utilization. Some of this is due to the CPU scheduling, but some is due to the limitations of running multiple threads on a single core.

Taking this further, the more recent Desktop CPUs have both Performance cores and Efficiency cores. This also results in a difference in multithreaded performance across the CPU cores and threads.