KeyError when loading comet

Bernmeister commented 2 years ago

Unsure if this is related to #707 but I've not run my comet code since then and noticed today I'm getting a KeyError when referring to a comet name. Here is sample code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from skyfield.api import load
from skyfield.constants import GM_SUN_Pitjeva_2005_km3_s2 as GM_SUN
from skyfield.data import mpc

url = "file://./CometEls.txt" 
# url = mpc.COMET_URL # Same result

with load.open( url ) as f:
    comets = mpc.load_comets_dataframe( f) 
    # comets = mpc.load_comets_dataframe_slow( f) # Same result

print( len( comets ), "comets loaded" )

ts = load.timescale()
ephemeris = load( "de421.bsp" )
sun, earth = ephemeris[ "sun" ], ephemeris[ "earth" ]

row = comets.loc[ "1P/Halley" ] # KeyError: '1P/Halley'
comet = sun + mpc.comet_orbit( row, ts, GM_SUN )
t = ts.utc( 2022, 5, 31 )
ra, dec, distance = earth.at( t ).observe( comet ).radec()
print( ra )
print( dec )

I get KeyError: '1P/Halley' at the line when extracting the row.

I have run pip3 install --upgrade jplephem numpy pandas skyfield and those packages are at version:

Package                Version
---------------------- --------------------
jplephem               2.17
numpy                  1.23.1
pandas                 1.4.3
skyfield               1.43.1

Running on Ubuntu 20.04.

brandon-rhodes commented 2 years ago

If I print(comets) your dataframe, it looks like it's indexed by integer:

     perihelion_year  ...      reference
0               1997  ...      MPC106342
1               2019  ...      NK   1615
2               2018  ...      MPC 75703
3               2019  ...      NK    731
4               2025  ...      MPC 75704
..               ...  ...            ...
946             2025  ...  MPEC 2022-HC2
947             2022  ...  MPEC 2022-GJ3
948             2022  ...  MPEC 2022-L66
949             2017  ...      MPC107687
950             2019  ...  MPEC 2020-K19

The index of a Pandas dataframe is the un-labeled column over to the left of all other columns. So this can be indexed like (to choose an arbitrary index shown above):

>>> print(comets.loc[4])

perihelion_year                                           2025
perihelion_month                                            12
perihelion_day                                         23.6383
perihelion_distance_au                                 3.30022
eccentricity                                          0.210298
argument_of_perihelion_degrees                         161.715
longitude_of_ascending_node_degrees                    285.296
inclination_degrees                                     5.0297
magnitude_g                                               13.5
magnitude_k                                                  2
designation                            P/1999 XN120 (Catalina)
reference                                            MPC 75704
Name: 4, dtype: object

The comets documentation includes a formula for re-indexing the dataframe by comet name:

comets = (comets.sort_values('reference')
          .groupby('designation', as_index=False).last()
          .set_index('designation', drop=False))

print(comets.loc['1P/Halley'])

See whether that maneuver also, on your machine, results in a dataframe that when printed shows the comet name over at the left as the index value.

Bernmeister commented 2 years ago

I completely missed the section in the examples:

comets = (comets.sort_values('reference')
          .groupby('designation', as_index=False).last()
          .set_index('designation', drop=False))

All good and thank you; humble apologies and hanging my head in shame for this blunder!

brandon-rhodes commented 2 years ago

I suspect that the documentation buried the point of that (fairly tangled) line of code a bit too deeply. The next time I'm in that part of the docs, I'll see if I can reformat the comments around it to draw the eye a little more emphatically!

Bernmeister commented 2 years ago

One thing which has bothered me is the difference in which the dataframes between comets and minor planets are treated (prior to being calculated). This is entirely due to my inexperience of pandas/numpy which I'd not met until I started my journey from PyEpehem to Skyfield. Through sheer brute force, I am overly familiar with the file formats for each of comets and minor planets from the Minor Planet Center and have found issues with data for example which you've highlighted in the Skyfield documentation (for example #449).

In the documentation I'm comparing, rightly or wrongly, the comets and minor planets in terms of writing Python. I guess coming from PyEphem, I'm used to one function which takes the comet or minor planet data file and spits out the answer. I have screened (XEphem formatted) data to remove lines containing '****' (#503) and missing the absolute magnitude component.

If at some stage you will put the Kepler orbit documentation page back on on the blocks, some questions which you might consider are:

Why is printing comets, print( len( comets ), 'comets loaded' ) different to printing minor planets, print( minor_planets.shape[ 0 ], 'minor planets loaded' )?
It took me some time to register that designation in minor_planets = minor_planets.set_index( 'designation', drop=False ) referred to mpc._MPCORB_COLUMNS. When I see these things now it is obvious; perhaps a one liner (either a comment in the example or an explicit line of text) mentioning the reference to the column names and where to find them?
In the comets section, you mention keeping only the most recent orbit. I presumed that one line (in CometEls.txt) corresponded to one orbit and if that is the case, I've not seen more than one line per comet. Does (should) this also apply then for minor planets? Further to this point, in the code comets = ( comets.sort_values( 'reference' ).groupby( 'designation', as_index = False ).last().set_index( 'designation', drop = False ) ), why sort by reference rather than designation? EDIT: I've since understood the logic behind "most recent orbit". Take for example comet 332P (Ikeya-Murakami). This comet comprises several fragments, each with its own entry in the data file. I can now see why you sought to eliminate all but the most recent "orbit", although, I'm unsure why you used the 'reference' field to sort rather than say 'designation' and take the first (as it might/should be the brightest).

brandon-rhodes commented 2 years ago

While your guess about the ‘most recent orbit’ logic is a good one, I think the actual trigger for that was a September 2020 episode in which comet NEOWISE had, temporarily, two entries in the comets data file, as mentioned in this comment:

https://github.com/skyfielders/python-skyfield/issues/449

Since we have never yet observed the MPC accidentally adding two entries for the same minor planet to the minor-planets file, the Skyfield example code doesn't have a workaround for it. Maybe someday we could move to having a one-sized-fits-all approach where all data files get the same treatment. But in the interests of simplest-possible (and thus fastest-possible) code, the approach so far has been to apply fixes to only the files that have proved historically to have needed those fixes at least once in their history.

skyfielders / python-skyfield

KeyError when loading comet #772