skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.44k stars 214 forks source link

KeyError when loading comet #772

Closed Bernmeister closed 2 years ago

Bernmeister commented 2 years ago

Unsure if this is related to #707 but I've not run my comet code since then and noticed today I'm getting a KeyError when referring to a comet name. Here is sample code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from skyfield.api import load
from skyfield.constants import GM_SUN_Pitjeva_2005_km3_s2 as GM_SUN
from skyfield.data import mpc

url = "file://./CometEls.txt" 
# url = mpc.COMET_URL # Same result

with load.open( url ) as f:
    comets = mpc.load_comets_dataframe( f) 
    # comets = mpc.load_comets_dataframe_slow( f) # Same result

print( len( comets ), "comets loaded" )

ts = load.timescale()
ephemeris = load( "de421.bsp" )
sun, earth = ephemeris[ "sun" ], ephemeris[ "earth" ]

row = comets.loc[ "1P/Halley" ] # KeyError: '1P/Halley'
comet = sun + mpc.comet_orbit( row, ts, GM_SUN )
t = ts.utc( 2022, 5, 31 )
ra, dec, distance = earth.at( t ).observe( comet ).radec()
print( ra )
print( dec )

I get KeyError: '1P/Halley' at the line when extracting the row.

I have run pip3 install --upgrade jplephem numpy pandas skyfield and those packages are at version:

Package                Version
---------------------- --------------------
jplephem               2.17
numpy                  1.23.1
pandas                 1.4.3
skyfield               1.43.1

Running on Ubuntu 20.04.

brandon-rhodes commented 2 years ago

If I print(comets) your dataframe, it looks like it's indexed by integer:

     perihelion_year  ...      reference
0               1997  ...      MPC106342
1               2019  ...      NK   1615
2               2018  ...      MPC 75703
3               2019  ...      NK    731
4               2025  ...      MPC 75704
..               ...  ...            ...
946             2025  ...  MPEC 2022-HC2
947             2022  ...  MPEC 2022-GJ3
948             2022  ...  MPEC 2022-L66
949             2017  ...      MPC107687
950             2019  ...  MPEC 2020-K19

The index of a Pandas dataframe is the un-labeled column over to the left of all other columns. So this can be indexed like (to choose an arbitrary index shown above):

>>> print(comets.loc[4])

perihelion_year                                           2025
perihelion_month                                            12
perihelion_day                                         23.6383
perihelion_distance_au                                 3.30022
eccentricity                                          0.210298
argument_of_perihelion_degrees                         161.715
longitude_of_ascending_node_degrees                    285.296
inclination_degrees                                     5.0297
magnitude_g                                               13.5
magnitude_k                                                  2
designation                            P/1999 XN120 (Catalina)
reference                                            MPC 75704
Name: 4, dtype: object

The comets documentation includes a formula for re-indexing the dataframe by comet name:

comets = (comets.sort_values('reference')
          .groupby('designation', as_index=False).last()
          .set_index('designation', drop=False))

print(comets.loc['1P/Halley'])

See whether that maneuver also, on your machine, results in a dataframe that when printed shows the comet name over at the left as the index value.

Bernmeister commented 2 years ago

I completely missed the section in the examples:

comets = (comets.sort_values('reference')
          .groupby('designation', as_index=False).last()
          .set_index('designation', drop=False))

All good and thank you; humble apologies and hanging my head in shame for this blunder!

brandon-rhodes commented 2 years ago

I suspect that the documentation buried the point of that (fairly tangled) line of code a bit too deeply. The next time I'm in that part of the docs, I'll see if I can reformat the comments around it to draw the eye a little more emphatically!

Bernmeister commented 2 years ago

One thing which has bothered me is the difference in which the dataframes between comets and minor planets are treated (prior to being calculated). This is entirely due to my inexperience of pandas/numpy which I'd not met until I started my journey from PyEpehem to Skyfield. Through sheer brute force, I am overly familiar with the file formats for each of comets and minor planets from the Minor Planet Center and have found issues with data for example which you've highlighted in the Skyfield documentation (for example #449).

In the documentation I'm comparing, rightly or wrongly, the comets and minor planets in terms of writing Python. I guess coming from PyEphem, I'm used to one function which takes the comet or minor planet data file and spits out the answer. I have screened (XEphem formatted) data to remove lines containing '****' (#503) and missing the absolute magnitude component.

If at some stage you will put the Kepler orbit documentation page back on on the blocks, some questions which you might consider are:

brandon-rhodes commented 2 years ago

While your guess about the ‘most recent orbit’ logic is a good one, I think the actual trigger for that was a September 2020 episode in which comet NEOWISE had, temporarily, two entries in the comets data file, as mentioned in this comment:

https://github.com/skyfielders/python-skyfield/issues/449

Since we have never yet observed the MPC accidentally adding two entries for the same minor planet to the minor-planets file, the Skyfield example code doesn't have a workaround for it. Maybe someday we could move to having a one-sized-fits-all approach where all data files get the same treatment. But in the interests of simplest-possible (and thus fastest-possible) code, the approach so far has been to apply fixes to only the files that have proved historically to have needed those fixes at least once in their history.