Commas in the MPC CometEls.txt file causes errors.

ursomniac commented 2 years ago

On the most recent versions of the CometEls.txt file, the code mpc.load_comets_dataframe() barfs, with an error of: ParserError: Error tokenizing data. C error: Expected 12 fields in line 522, saw 13

The problem seems to be that (for whatever reason) the CometsEls.txt file sometimes has a comma in the last column, e.g.: 0001P 1986 02 13.7202 0.602070 0.966313 111.8933 58.9646 162.2253 20220129 4.0 6.0 1P/Halley 98, 1083

If you edit the file and replace the commas with spaces, the problem goes away. So, the code in load_comets_dataframe() ought to either pre-sanitize the data or somehow treat commas in the content as text and not delimiters.

brandon-rhodes commented 2 years ago

@ursomniac — Alas, I can't get that error to appear here locally on my laptop. I've downloaded a recent CometEls.txt that, exactly as you say, has commas in it, like this:

0001P         1986 02 17.8475  0.600820  0.966382  112.0058   59.0800  162.2115  20220510   4.0  6.0  1P/Halley                                                 98, 1083

But it's not bothering mpc.load_comets_dataframe() at all. This script runs without error:

    from skyfield.api import load
    from skyfield.data import mpc

    with load.open(mpc.COMET_URL) as f:
        comets = mpc.load_comets_dataframe(f)

    print(len(comets), 'comets loaded')

and prints:

947 comets loaded

Do you have any ideas about how I could get the error to trigger? Is the error specific to your version of Pandas, maybe?

skybber commented 2 years ago

I'm fighting with the same error, I've tried example from docs:

>>> from skyfield.api import load
>>> from skyfield.data import mpc
>>> with load.open(mpc.COMET_URL) as f:
...     comets = mpc.load_comets_dataframe(f)
... 
[#################################] 100% CometEls.txt
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/lada/.local/lib/python3.10/site-packages/skyfield/data/mpc.py", line 189, in load_comets_dataframe
    df = pd.read_csv(io.BytesIO(text), header=None, names=_COMET_FAST_COLUMNS)
  File "/home/lada/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/lada/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/lada/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/home/lada/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/lada/.local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 506, saw 13

jurezakrajsek commented 2 years ago

Not sure if this is related, but there was some errors on the MPC side, the file was not generated correctly and it contained some strange lines.

             **** 01  1       0.00      0.00        0         0         0                  9.0  4.0
    CJ95O010  1997 03 29.6811  0.891645  0.994970  130.4217  282.8643   89.2250  20220607  -2.0  4.0  C/1995 O1 (Hale-Bopp)                                    MPC106342

I have reported this to the MPC and they have resolved it.

brandon-rhodes commented 2 years ago

Thank you for letting them know, @jurezakrajsek — I wasn't yet awake here in my timezone, and it’s always wonderful to wake up and find an issue has already been resolved!

I wonder whether this issue by @ursomniac was related to a transient error in the file on the MPC’s side of things? @ursomniac, could you check whether you still get the parsing error with a fresh version of the file? Thanks!

skybber commented 2 years ago

It seems that the problem with first line is fixed now!

Bernmeister commented 2 years ago

Still hitting the same issue, irrespective of using mpc.COMET_URL or downloading CometEls.txt and manually removing the ',' (from four locations). I also did a check for spurious **** as I've been bitten by that before, but found none.

If it helps @brandon-rhodes to compare against your environment, below is a trimmed list of installed packages on my Ubuntu 20.04

python3 -m pip list
Package                Version
---------------------- --------------------
ephem                  4.1.3
jplephem               2.17
numpy                  1.22.4
pandas                 1.4.2
pytz                   2022.1
sgp4                   2.20
skyfield               1.42

Bernmeister commented 2 years ago

...and if instead I use the slow method:

from skyfield.api import load
from skyfield.data import mpc

url = mpc.COMET_URL
with load.open( url ) as f:
    comets = mpc.load_comets_dataframe_slow( f )
print( len( comets ), 'comets loaded' )

It works: 950 comets loaded.

I guess something is off in the regex/parsing part of load_comets_dataframe() (and I should stress that I see no ',' in CometEls.txt which may suggest this is a separate issue).

brandon-rhodes commented 2 years ago

@ursomniac, @Bernmeister — It looks like the error happens only with very recent versions of Pandas, so I hadn't caught it yet locally. I have just landed a fix that, until the day that the MPC uses the ASCII character Record Separator somewhere inside of a field, should prevent characters in MPC fields from confusing the fast C parser inside of Pandas.

To install the development version of Skyfield to try out the fix, run:

pip install -U https://github.com/skyfielders/python-skyfield/archive/master.zip

The improvement should appear with the next release of Skyfield.

brandon-rhodes commented 2 years ago

(And, per your comments, @Bernmeister, I see that this might not fix things for you? If not then let me know and I'll return to your list of package versions—thanks for providing it!—and see if I can find another important difference between our environments.)

Bernmeister commented 2 years ago

@brandon-rhodes Thanks, I'll give the development version a try; how do I then revert back to the released version (or shouldn't I do that or does pip figure it all out automagically)?

brandon-rhodes commented 2 years ago

Good question. I usually pip uninstall skyfield and then pip install skyfield if I want to know I'm working from a clean slate, but Stack Overflow might offer more interesting variations on the maneuver "return to the most recent official release"!

Bernmeister commented 2 years ago

Unfortunately, the fix did not fix...assuming the update worked as when I run python3 -m pip list I get skyfield 1.42 (and downloaded a fresh CometsEls.txt), I still get the error

Traceback (most recent call last):
  File "testComets.py", line 219, in <module>
  File "testComets.py", line 175, in computeCometsSkyfield
    comets = mpc.load_comets_dataframe( f )
  File "/usr/local/lib/python3.8/dist-packages/skyfield/data/mpc.py", line 189, in load_comets_dataframe
    df = pd.read_csv(io.BytesIO(text), header=None, names=_COMET_FAST_COLUMNS)
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 507, saw 13

and mpc.load_comets_dataframe_slow() works as expected.

brandon-rhodes commented 2 years ago

Alas, you are still using the old code—the read_csv() is now broken across two lines:

    df = pd.read_csv(io.BytesIO(text), sep=_COMET_SEP, header=None,
                     names=_COMET_FAST_COLUMNS)

—but in your traceback I can see that it's only one line, with no sep= specified. (And: yay for tracebacks that make it clear what lines of code look like!) So the pip line you ran to upgrade to the Zip file from GitHub must have failed to give you the new code.

Hmm.

Maybe pip uninstall skyfield and then try again with the following line of code?

pip install -U https://github.com/skyfielders/python-skyfield/archive/master.zip

Bernmeister commented 2 years ago

The fix is in! Many thanks again @brandon-rhodes

FWIW, I sudo pip3 uninstall skyfield, then sudo pip3 cache remove skyfield and installed (as per above) and then verified the code was actually present.

skyfielders / python-skyfield

Commas in the MPC CometEls.txt file causes errors. #707