skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.41k stars 212 forks source link

Comet packed designation wrong for fragmented periodic comets #965

Open jurezakrajsek opened 4 months ago

jurezakrajsek commented 4 months ago

I noticed that for some comets the packed designation is parsed wrong This is the MPC 1-line orbital elements beeing parsed 0323P b 2025 12 16.3240 0.040025 0.986146 353.9756 323.4582 5.4670 20240331 26.0 4.0 323P-B/SOHO MPEC 2024-F21

And the output of the mpc.load_comets_dataframe_slow(io.BufferedReader(orbit))

designation                            323P-B/SOHO
number                                         323
orbit_type                                       P
designation_packed                               b
perihelion_year                               2025
perihelion_month                                12
perihelion_day                             16.3223
perihelion_distance_au                    0.040033
eccentricity                              0.986143
argument_of_perihelion_degrees            353.9749
longitude_of_ascending_node_degrees       323.4587
inclination_degrees                         5.4663
perturbed_epoch_year                          2024
perturbed_epoch_month                            4
perturbed_epoch_day                             13
magnitude_g                                   26.0
magnitude_k                                    4.0
reference                                MPEC 2024
Name: 323P-B/SOHO, dtype: object

I would guess that for P type orbit the packed designation should be nan but in this case it reads the fragment part.

Here is a short script to reproduce:

import io
from skyfield.data import mpc

comet_orbit = '0323P      b  2025 12 16.3240  0.040025  0.986146  353.9756  323.4582    5.4670  20240331  26.0  4.0  323P-B/SOHO                                              MPEC 2024-F21'
orbit = io.BytesIO(bytes(comet_orbit, 'utf-8'))
c = mpc.load_comets_dataframe_slow(io.BufferedReader(orbit))
c = (c.groupby('designation', as_index=False).last()
      .set_index('designation', drop=False))
row = c.iloc[0]
print(row)
brandon-rhodes commented 4 months ago

I would guess that for P type orbit the packed designation should be nan but in this case it reads the fragment part.

Let's read down through the whole file to see how things look. This little command line:

cat CometEls.txt | awk '{print substr($0,1,24)}'

—shows that the file is cleanly split between lines with designations in one format, and a second section that uses another. Omitting long stretches of lines with the same format, here's what we see:

    CJ95O010  1997 03 29
    PJ96R020  2019 02 12
    PJ98V24S  2018 01 19
...
    CK22D020  2022 03 27
    PK22E010  2022 10 10
    CK22E020  2024 09 13
    CK22E030  2023 01 12
    CK22F010  2022 07 28
    CK22F020  2022 03 23
    CK22H010  2024 01 18
    CK22J010  2022 02 16
0001P         1986 02 17
0002P         2023 10 20
0004P         2021 09  8
...
0050P         2024 05 12
0051P      a  2022 10  3
0051P      d  2022 10  1
0052P         2021 10  5
...
0441P         2025 09  7
0001I         2017 09  9
0002I         2019 12  8

The first thing we can note is that we can't, alas, use your suggestion of using the orbit type P to decide between the two formats, because there are P-type comets in both sections.

The second thing we note is that, as you correctly note, the same field is used sometimes for the (mostly blank) fragment number (I included comet 0051P specifically so we'd get to see an example of a case where the field wasn't blank), and sometimes it's used for a modern packed designation like the ones used for minor planets.

For reference, here's how the format is described on their website:

https://www.minorplanetcenter.net/iau/info/CometOrbitFormat.html

Columns   F77    Use

    1 -   4  i4     Periodic comet number
    5        a1     Orbit type (generally `C', `P' or `D')
    6 -  12  a7     Provisional designation (in [packed form](https://www.minorplanetcenter.net/iau/info/PackedDes.html))

Which makes it sound so simple! But over in their text description of the format, they describe two different ways the fields can be used, with the "packed form" designation instead being a fragment number:

https://www.minorplanetcenter.net/iau/info/PackedDes.html

So the problem here is that Pandas doesn't allow columns to have different names depending on their content. There's no way (at least that I know of?) to tell Pandas, "the third column of this file should be called 'fragment number' when there's a non-empty comet number in columns 1-4, but should be called 'packed designation' when columns 1-4 are blank."

So in my design for the column numbers, I tried to just stick with the "official" names of the columns given in the first link above.

Before wading into further details, let's step back a moment. When I wrote up the code to handle comments, I didn't pay attention to those first columns, because the lines of the file are now much longer than in the old days, and include the full text of the comet name way over to the right. In your example, we can just read the full name 323P-B/SOHO from the big field way over to the right, without needing to wade into the fields on the left. I guess the issue you've encountered is that the full name does not include the fragment number, the way the MPC have designed that new full-name field?

So what you want is a way to find the fragment number. Hmm.

Maybe the Skyfield documentation on comets could include a little snippet showing how to get the fragment number? It might work something like this:

import io
from skyfield.data import mpc

for comet_orbit in [
        # Three example lines from CometEls.txt of May 10 2022:

        '    CJ95O010  1997 03 29.6937  0.892238  0.994969  130.4267  282.8976   89.2092  20220510  -2.0  4.0  C/1995 O1 (Hale-Bopp)                                    MPC106342',
        '    PJ98V24S  2018 01 19.1688  3.427137  0.242855  244.6040  159.0401    5.0266  20220510  13.0  2.0  P/1998 VS24 (LINEAR)                                     MPC 75703',
        '0343P         2017 01 29.8032  2.266196  0.584845  137.4257  257.1514    5.5878  20220510  14.0  4.0  343P/NEAT-LONEOS                                         MPC109148',

        # Your example line:

        '0323P      b  2025 12 16.3240  0.040025  0.986146  353.9756  323.4582    5.4670  20240331  26.0  4.0  323P-B/SOHO                                              MPEC 2024-F21',
        ]:

    orbit = io.BytesIO(bytes(comet_orbit, 'utf-8'))
    c = mpc.load_comets_dataframe_slow(io.BufferedReader(orbit))
    c = (c.groupby('designation', as_index=False).last()
         .set_index('designation', drop=False))
    row = c.iloc[0]

    number = row.number
    if number > 0:
        fragment_number = row.designation_packed
    else:
        fragment_number = row.designation_packed[6].rstrip('0')

    print('Orbit type {!r}  Designation {!r}  Fragment number {!r}'
          .format(row.orbit_type, row.designation, fragment_number))

Hmm. There's a complication here. Because we're only giving one line at a time, Pandas doesn't have the full CometEls.txt to examine, and has to guess about column data types based on only one example line. Thus the output of my little example:

Orbit type 'C'  Designation 'C/1995 O1 (Hale-Bopp)'  Fragment number ''
Orbit type 'P'  Designation 'P/1998 VS24 (LINEAR)'  Fragment number 'S'
Orbit type 'P'  Designation '343P/NEAT-LONEOS'  Fragment number nan
Orbit type 'P'  Designation '323P-B/SOHO'  Fragment number 'b'

— gets a fragment number of nan, a floating point value, where we really want '' the empty string, because Pandas is guessing that the blank column is a float rather than a string. I guess we could spruce up the import a bit to know more about types.

But, first, I'll stop here to hear your comments, to learn whether getting the fragment value like this would answer your purpose — or whether there's some other underlying goal?

jurezakrajsek commented 4 months ago

Hi Brandon

Yes the format is really anoying and complicated :( The packed designation is very useful to use with astroquery.mpc to get ephemeris from the MPC website.

My idea was that for the numbered comets packed designation is constructed as f'{number}{orbit_tyxpe}' In case that a fragment is found in the designation_packed f'{number}{orbit_tyxpe}-{fragment_letter}'

For other comets the fragment is as I see always the last letter in the designation_packed if it is lower case

I have constructed a function that can be used and applied on the mpc.load_comets_dataframe_slow pandas dataframe. It overwrites the designation_packed for numbered comets and also adds a new column for fragment_letter might be useful in some cases. This way the designation_packed can always be used with astroquery.mpc

import io
import pandas as pd
import numpy as np
from skyfield.data import mpc

def parse_fragment_letter(df):
    if df.number > 0:
        if pd.isna(df.designation_packed):
            df['fragment_letter'] = np.nan
            df['designation_packed'] = f'{df.number:.0f}{df.orbit_type}'
        else:
            fragment_letter = df.designation_packed
            df['fragment_letter'] = fragment_letter
            df['designation_packed'] = f'{df.number:.0f}{df.orbit_type}-{fragment_letter.upper()}'
    else:
        if df.designation_packed[6].islower():
            df['fragment_letter'] = df.designation_packed[6]
        else:
            df['fragment_letter'] = np.nan
    return df

comet_df = pd.DataFrame()
for comet_orbit in [
        '    CJ95O010  1997 03 29.6937  0.892238  0.994969  130.4267  282.8976   89.2092  20220510  -2.0  4.0  C/1995 O1 (Hale-Bopp)                                    MPC106342',
        '    PJ98V24S  2018 01 19.1688  3.427137  0.242855  244.6040  159.0401    5.0266  20220510  13.0  2.0  P/1998 VS24 (LINEAR)                                     MPC 75703',
        '0343P         2017 01 29.8032  2.266196  0.584845  137.4257  257.1514    5.5878  20220510  14.0  4.0  343P/NEAT-LONEOS                                         MPC109148',
        '0323P      b  2025 12 16.3240  0.040025  0.986146  353.9756  323.4582    5.4670  20240331  26.0  4.0  323P-B/SOHO                                              MPEC 2024-F21',
        '0332P         2021 08 18.2949  1.576824  0.489303  152.3122    3.7703    9.3804  20230225  18.0  4.0  332P/Ikeya-Murakami                                      MPEC 2016-K18',
        '0332P      a  2021 08 18.4161  1.577135  0.489483  152.3403    3.7861    9.3800  20210814   9.0  4.0  332P-A/Ikeya-Murakami                                    MPEC 2016-K18',
        '0332P      b  2016 03 17.1650  1.572891  0.489707  152.3711    3.8028    9.3805            19.0  4.0  332P-B/Ikeya-Murakami                                    MPEC',
        '    CK20P04a  2020 08  8.4682  0.090986  1.002140  177.5194  175.1827   26.3726  20230225   5.3  6.0  C/2020 P4-A (SOHO)                                       MPEC 2020-QN1',
        '    CK20P04b  2020 08  8.3493  0.098737  1.000534  173.4621  171.4740   27.3278  20230225   0.0  6.0  C/2020 P4-B (SOHO)                                       MPEC 2020-QN1'
        ]:

    orbit = io.BytesIO(bytes(comet_orbit, 'utf-8'))
    c = mpc.load_comets_dataframe_slow(io.BufferedReader(orbit))
    c = (c.groupby('designation', as_index=False).last()
         .set_index('designation', drop=False))
    comet_df = pd.concat([comet_df, c])

comet_df = comet_df.apply(parse_fragment_letter, axis=1)
print(comet_df)
                                 designation  number orbit_type designation_packed  perihelion_year  ...  perturbed_epoch_day  magnitude_g  magnitude_k  reference  fragment_letter
designation                                                                                          ...
C/1995 O1 (Hale-Bopp)  C/1995 O1 (Hale-Bopp)     NaN          C            J95O010             1997  ...                 10.0         -2.0          4.0  MPC106342              NaN        
P/1998 VS24 (LINEAR)    P/1998 VS24 (LINEAR)     NaN          P            J98V24S             2018  ...                 10.0         13.0          2.0  MPC 75703              NaN        
343P/NEAT-LONEOS            343P/NEAT-LONEOS   343.0          P               343P             2017  ...                 10.0         14.0          4.0  MPC109148              NaN        
323P-B/SOHO                      323P-B/SOHO   323.0          P             323P-B             2025  ...                 31.0         26.0          4.0  MPEC 2024                b        
332P/Ikeya-Murakami      332P/Ikeya-Murakami   332.0          P               332P             2021  ...                 25.0         18.0          4.0  MPEC 2016              NaN        
332P-A/Ikeya-Murakami  332P-A/Ikeya-Murakami   332.0          P             332P-A             2021  ...                 14.0          9.0          4.0  MPEC 2016                a        
332P-B/Ikeya-Murakami  332P-B/Ikeya-Murakami   332.0          P             332P-B             2016  ...                  NaN         19.0          4.0       MPEC                b        
C/2020 P4-A (SOHO)        C/2020 P4-A (SOHO)     NaN          C            K20P04a             2020  ...                 25.0          5.3          6.0  MPEC 2020                a        
C/2020 P4-B (SOHO)        C/2020 P4-B (SOHO)     NaN          C            K20P04b             2020  ...                 25.0          0.0          6.0  MPEC 2020                b        

[9 rows x 19 columns]
brandon-rhodes commented 4 months ago

Thanks for providing more example code!

I agree that it will be nice, either in the documentation or a provided function, for Skyfield to show folks how to handle the inconsistent way that MPC represents comet fragments. But I suspect I should finish out a few items that I'm already working on for Skyfield first, and then once a new release is out, return to this topic. I'll keep it open, though, and hopefully other folks who need your code can cut-and-paste it from here in the meantime. Thanks!

jurezakrajsek commented 4 months ago

@brandon-rhodes Is there a way that I can help you with this function? We need to zero-padd the number before making the MPC packed designation for numbered comets. The comet number is stored zero-padded right-justified in columns 1-4. Column 5 usually contains "P"

Something like that: df['designation_packed'] = f'{int(df.number):04d}{df.orbit_type}'

So packed designation for numbered comets would be something like: '0343P' or '0332P-A'

For astroquery a packed designation for numbered comets must contain the object type (example: '0343P'). If you send a query without object type (example: '0343') as object ID you get the ephemeris for asteroid 343 instead of comet 343P. For other comets both options work (example: 'J98V24S' or 'PJ98V24S').