skyfielders / python-skyfield

Elegant astronomy for Python
MIT License
1.41k stars 211 forks source link

ValueError when reading hip_main.dat.gz #130

Closed bj0 closed 6 years ago

bj0 commented 7 years ago

I just tried to import the stars using skyfield.data.hipparcos.load. It downloads the file and then loads the first 420 stars, but throws ValueError on star 421. Looking at the data file some of the fields are blank for that star, which skyfield doesn't handle.

brandon-rhodes commented 7 years ago

Thanks for the report! I will try to get to this before PyCon, but it might instead be after the conference that I have time to tackle it (though, of course, anyone else is free to try to create a fix in the meantime). I assume that I can reproduce this error using the default hip_main.dat.gz file whose URL is at the top of the hipparcos.py file?

bj0 commented 7 years ago

Probably, I didn't change anything so it's just downloading it. I deleted it and let it download it again but it didn't change anything.

brandon-rhodes commented 7 years ago

I have downloaded the file and there was no error when Skyfield parsed it. Is it possible, since the link to the file isn’t an https link but instead just a plain-text non-error-corrected protocol, that there is a bad byte in your copy of the file that happened as it was downloaded? If you run the following code, you can find out if our two copies of the file are exactly the same:

from hashlib import sha256
content = open('hip_main.dat.gz', 'rb').read()
print(sha256(content).hexdigest())

My version prints:

96fe641a7112f41893f6632d91762179c7f3cf072a16d2beff62e37d5b3db65c
bj0 commented 7 years ago

I lost the files I had been using before, but I re-created the script and it re-downloaded the hip_main file. It's hash matches yours and it still produces the error on star 421.

Here's the script that causes it:

from skyfield.data.hipparcos import load

stars = []
for star in load(lambda x: True):
#    print(star)
    stars.append(star)

print(stars)
bj0 commented 7 years ago

oops, hit wrong button

brandon-rhodes commented 7 years ago

I see — you are trying to load every star in the database, even though there are stars without valid fields defined for decimal RA and Dec. I had only ever tried to load specific stars, never all of them at once.

If an attempt is made to load a start without decimal RA and Dec, I wonder what should happen?

bj0 commented 7 years ago

I'm not sure, an error saying what the actual problem is might be useful, but if the RA/Dec are there in HMS/DMS, then I don't know why they aren't there in DD. The conversion is pretty simple, and why would they be in the catalog if they are invalid?

ghost commented 7 years ago

In a fit of whimsy, I did perl -nle '$n++; print "$n ",substr($_,87,8)' hip_main.dat (after uncompressing hip_main.dat.gz) and found what I expected: star 421 is the first star for which that field, ra_mas_per_year, is blank. This code casts the field to float: https://github.com/skyfielders/python-skyfield/blob/master/skyfield/data/hipparcos.py#L16

Probable fix: Check all fields for empty string and set to NULL as needed.

brandon-rhodes commented 6 years ago

It won't come out in a Skyfield release for a few more days, but for those of you working from master, check out the load_dataframe() method in hipparcos.py. It needs a bit of cleanup (it still prints the number of seconds required to load the catalog, for example!), but it loads the entire Hipparcos catalog for me in about 1 second.

As you'll see from my heavily in-progress work in charting.py, the combination of Pandas and NumPy means that you can compute the coordinates of the entire sky full of stars with:

    c = catalog
    c = c[c['magnitude'] <= limiting_magnitude]
    c = c.sort_values('magnitude', ascending=False)
    s = Star(ra_hours=c.ra_hours, dec_degrees=c.dec_degrees)
    spos = o.observe(s)

While I hesitate to ever ask folks to install additional dependencies, I think that Pandas is such a huge improvement in how to handle thousands of stars over earlier approaches involving separate Star objects that I might think of making this standard. Also, it's so much more fun to let Pandas do the parsing of the text file instead of doing it by hand! All I had to type was:

    names, colspecs = zip(
        ('hip', (2, 14)),
        ('magnitude', (41, 46)),
        ('ra_degrees', (51, 63)),
        ('dec_degrees', (64, 76)),
    )

—and Pandas did everything else!

brandon-rhodes commented 6 years ago

I'm going to close this for now, since it looks like the Pandas approach will work well — feel free to open further issues as you run into problems or new ideas, though!

brandon-rhodes commented 6 years ago

In a few minutes I'll be releasing Skyfield 1.7 with the documentation on Stars rewritten to highlight the ability to load the Hipparcos library from a Pandas dataframe. Hopefully we'll get feedback on how the feature works for real users!

rtphokie commented 5 years ago

The default value for hipparcos.URL (ftp://cdsarc.u-strasbg.fr/cats/I/239/hip_main.dat.gz) appears to be offline. I've had good luck switching that to http://cdsarc.u-strasbg.fr/ftp/cats/I/239/hip_main.dat.gz