skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.41k stars 212 forks source link

Strange traceback from `.apparent()` when computing sun angle. #914

Closed meawoppl closed 10 months ago

meawoppl commented 11 months ago

Once about every 10,000,000 invocations I see a traceback like this:

ValueError: cannot reshape array of size 45760 into shape (1760,23)

at ._data ( /usr/local/lib/python3.9/site-packages/jplephem/spk.py:177 )
at .__get__ ( /usr/local/lib/python3.9/site-packages/jplephem/descriptorlib.py:12 )
at .generate ( /usr/local/lib/python3.9/site-packages/jplephem/spk.py:209 )
at .compute_and_differentiate ( /usr/local/lib/python3.9/site-packages/jplephem/spk.py:159 )
at ._at ( /usr/local/lib/python3.9/site-packages/skyfield/jpllib.py:218 )
at .at ( /usr/local/lib/python3.9/site-packages/skyfield/vectorlib.py:92 )
at .add_deflection ( /usr/local/lib/python3.9/site-packages/skyfield/relativity.py:52 )
at .apparent ( /usr/local/lib/python3.9/site-packages/skyfield/positionlib.py:781 )
at .is_sun_up ( /app/teletom_shared/src/teletom_shared/math/terrestrial.py:32 )

[truncated trace from our codebase]

The calling method does something like this:

def is_sun_up(location: GeographicPosition, time: Time):
    sun = planets["sun"]
    location_ssb = planets["earth"] + location
    crash = location_ssb.at(time).observe(sun).apparent().altaz()[0].degrees >= -0.8333

which is nearly basically boiler plate from the examples...

Curiously 45760 divides evenly into 1760, but the resulting array would have 26 rows vs. the attempted reshape above. Maybe this is an error in jplephem?

Any wisdom appreciated, and thanks.

brandon-rhodes commented 11 months ago

I can't think of anything offhand that could cause that symptom, so I'll ask a few questions that might help us circle in. First: is this happening after a file has been open for a very long time, or when an ephemeris is first being accessed? The _data() method should only be ever called once per segment, on the very first occasion when its data is accessed, and the memory map is then re-used from then on.

In the code, I don't see any numbers coming in from Skyfield. These quantities all come from the ephemeris itself, right?

        init, intlen, rsize, n = self.daf.read_array(self.end_i - 3, self.end_i)
        coefficient_count = int(rsize - 2) // component_count
        coefficients = self.daf.map_array(self.start_i, self.end_i - 4)

        coefficients.shape = (int(n), int(rsize))
meawoppl commented 11 months ago

We are running it in a scheduled way, and it seems to break on the first call.

It looks like this might be a threading issue. We use a concurrent.futures.ThreadPoolExecutor to batch a bunch of satellite predictions. The first step is figuring out sunrise/sunset local timing, and thus the imaging window we can schedule during. I see two errors where two thread see similar reshape issues. I suspect something down the stack isn't happy with the multi-threaded: image

Both threads startup and call the method I described earlier, and seemed to log errors within 1ms of each other...

Looking at jplephem.descriptorlib I see a form of memoization, but I don't really get what it is doing. My suspicion is that the seek()/read()/write() in the jplephem.daf is the thread unsafe bit and it returning stuff with the wrong size perhaps?

brandon-rhodes commented 11 months ago

Ah, it's breaking on the first call! Yes, the problem must be interlacing between one thread's seek()+read() pair and another thread's attempt to seek and read, so that at least one of the read()s winds up reading where it didn't intend.

I have added a thread lock around seek+read pairs. Could you try out what happens with this new code before I try releasing it? To install the development version:

pip install -U https://github.com/brandon-rhodes/python-jplephem/archive/master.zip
meawoppl commented 11 months ago

A few notes. The diff looks like it would fix the issue. Many thanks!

If you wanted to make add_array() thread safe as well, the use of a threading.RLock() would do it pretty easily.

This is the first time I have seen this particular failure. That particular wedge of code runs every 15 minutes for... the last year or so, it would estimate a 1/30,000 chance of reproduction that way.

I might suggest a test case that is something like:

from concurrent.futures import ThreadPoolExecutor

def test_notion():
    for i in range(10):
        cache = ...
        with ThreadPoolExecutor(wrokers=2) as tpe:
           fut1 = tpe.submit(cache.get_method, args)
           fut2 = tpe.submit(cache.get_method, args)

           fut1.result()
           fut2.result()        
brandon-rhodes commented 11 months ago

A few notes. The diff looks like it would fix the issue. Many thanks! … That particular wedge of code runs every 15 minutes for... the last year or so, it would estimate a 1/30,000 chance of reproduction that way.

Then I won't wait for you to test the new version extensively. I'll just plan a release soon and hope the locks went in the right place!

brandon-rhodes commented 10 months ago

I have just released a new jplephem as a quick fix for this. Hopefully it won't cause your threads any further problems—enjoy!