sczesla / PyAstronomy

A collection of astronomy-related routines in Python
152 stars 35 forks source link

Using pandas instead of np.recarray as a data container for exoplanets(EU/ORG) #10

Closed DanielAndreasen closed 8 years ago

DanielAndreasen commented 9 years ago

Hi Stefan,

In issue #8 you mention that it might be a good idea to to pandas. I took the liberty to do so, and I think it is working.
My editor removes trailing whitespaces, so there are many lines edited, which is not important. Sorry for that.

Lastly, I did not change in the examples in the doc, so maybe this pull request should only be accepted, when the docs are ready. But that is up to you.

Cheers, Daniel

sczesla commented 9 years ago

Hi Daniel,

you were faster than I had anticipated (: Thanks for providing the code. As said, pandas was made for the job, and I consider it superior to my prior solution based on dictionaries, which has always been a little on the cumbersome side.

That said, we must keep in mind that some people may actually be using the implementations already, and most people absolutely hate updates breaking their code. I think that should be avoided whenever possible.

At least in the short term, this implies that the old structure should somehow continue to coexist with a newer pandas-based line. I can think of a number of possibilities to achieve that. First, the most straightforward possibility is to create a totally distinct class and mark the other one deprecated (yet still functional). Second, both lines could coexist in the same class, and third, there may be a switch (e.g., pandas=False/True) in the constructor, which would, however, to point to False by default.

While 2 and 3 keep the interface somewhat cleaner by not introducing two classes/options to handle the same thing, they will make it harder to drop the old line at some point.

In any case, this may be a great opportunity to fully support exoplanet.eu, which provides more columns now than it did, when I set up the class. This may also a good time for thinking about some "unified access" to all the DBs. Maybe the documentation of the columns could somehow be created dynamically to handle changes in the DBs, but that is another thing...

Would be great if you could give this a thought, Cheers, Stefan

DanielAndreasen commented 9 years ago

Hi Stefan,

I'm afraid I will not always be as fast :) I think my attempt is still lacking some stuff (e.g. a more thorough testing, which I did not do). Also, I fully agree with being compatible with older versions. I didn't even think of that. But we can maybe use this as a starting point and combine it.

To have a switch of pandas=True/False seems like the best solution. Maybe do as numpy, scipy, astropy, etc. and give a deprecated warning if sat to False (even if that is the default), so people will move towards the pandas solution.

Did you ever consider to split up PyAstronomy? It is starting to cover a lot of different things. E.g. if the DB functionality is expanded, it may be better to have that seperately. Kind of what astropy does with affiliated packages. This is probably another discussion al together.
Speaking of DB, there is actually a nice package that does a lot of stuff, astroquery. I played around with it, but the problem I have with astropy and the affiliated packages is, that it miss basic functionality, like RV-shift a spectrum. That is why I use PyAstronomy.

sczesla commented 9 years ago

Hi Daniel,

at least this time you were impressively fast again... Right, the code is a good starting point. I will set up a topic branch to move on.

Well, yes, I played with the idea of splitting PyA several times. So far, however, I have always refrained from it and I have no plans to do a split soon. As you say, that is another discussion. BTW, the relation between the astropy affiliates and the core is similar to what I envisaged for pyaC(ore) and the other subpackages. Admittedly, this is currently only quite true for error handling and the access to locally stored data.

Although I do not currently favor a split, I do see your point and I agree that the DB functionality, in particular when it grows, should should be "bundled", keeping it within the PyA framework. I was never fully happy with the "resource based helper" caption within pyasl under which these classes are currently listed. Maybe a more specific "subsection" in pyasl could already help a lot. This would, of course, not be a code-based separation.

Thanks for pointing out astroquery. It is definitely a great tool. Even had it installed...What I was thinking about when I spoke about a "unification" of access here was to somehow unify the various column names the data bases assign to their quantities. I always found this a little cumbersome to deal with, but not an epic problem, too. That would need some handcraft, but not much more.

Cheers, Stefan

DanielAndreasen commented 9 years ago

Hi Stefan,

I think you should take the first step, then I will help you out. Feel free to ask for help, even though you can do it yourself, we can do it faster together. I already pulled the new branch to my system, so we should be ready. The unification you speak about sounds interesting. I guess this will allow to easily compare different DBs with one another.

For all your comments, I agree, but in the end, it is up to you how you want to organize the code.

As you may know by now, I like to work on projects like this.

DanielAndreasen commented 9 years ago

Btw, this is pure laziness, and I should really create a whole new issue or even a pull request, but the task is so small that it seems too much.

In my fork, I renamed README to README.md and made very few changes in the syntax so it simply looks better (markdown highlighting). It is something PyA has missed in my opinion :) Take a look: https://github.com/DanielAndreasen/PyAstronomy

sczesla commented 9 years ago

Cool. Thanks for the markup. I picked your commits.

You are right, I will make a first step on the pandas issue -- will perhaps take a few days... Then we are two who like to work on such stuff...

DanielAndreasen commented 8 years ago

I think we are not working on this anymore.

sczesla commented 8 years ago

Hi Daniel,

sorry for the immense delay. I was simply very distracted by all kinds of other obligations. However, I never lost this completely out of sight, and I do plan to implement this. I am just slow...

Cheers, Stefan

DanielAndreasen commented 8 years ago

Hi Stefan,

I actually forgot what it was about. Let me know if you need some help.

Cheers, Daniel

sczesla commented 8 years ago

Hi Daniel,

we had been talking about using pandas as a DB a while ago. Finally, I had some time to go back and have a look at least at exoplanet.eu. I realized that the VO table they provide is by far the most complete one in terms of meta data, which is why I consider that one the one to go for.

I have implemented the ExoplanetEU2 class, which uses the VO table (please mind to checkout the exoplanetVO branch to take a look). What do you think?

Note that I first thought about ExoplanetVO as a name. However, as most people would not care whatever table is used to fetch the data, I simply went for the '2', which seems more like a natural successor...

Cheers, Stefan

DanielAndreasen commented 8 years ago

Hi Stefan,

I took a quick look at the source code, but it's too late to play around with it now. That will have to wait for tomorrow (: However, it seems very nice, and with the ability to convert to pandas dataframes, this will cover what everybody wants. I've never worked with VO tables before, but it will be interesting to see more about it. Will give you my feedback later.

Cheers, Daniel

sczesla commented 8 years ago

Hi Daniel,

definitely too late. The actual DB is now an astropy table, however, export to pandas (and actually also other formats) is trivial. Just for convenience, I quote the code example below.

Cheers, Stefan

from PyAstronomy import pyasl
import matplotlib.pylab as plt

# Instantiate exoplanetEU2 object
v = pyasl.ExoplanetEU2()

# Show the available data
v.showAvailableData()
print

# Get a list of all available column names
acs = v.getColnames()
print "Available column names: " + ", ".join(acs)
print

# Select data by planet name (returns a dictionary)
print v.selectByPlanetName("CoRoT-2 b")
print

# Get all data as an astropy table
at = v.getAllDataAPT()

# Export all data as a pandas DataFrame
pd = v.getAllDataPandas()

# Plot mass vs. SMA
plt.title("Mass vs. SMA")
plt.xlabel("[" + v.getUnitOf("mass") + "]")
plt.ylabel("[" + v.getUnitOf("semi_major_axis") + "]")
plt.loglog(at["mass"], at["semi_major_axis"], 'b.')
plt.show()
DanielAndreasen commented 8 years ago

Hi again Stefan,

I really like this version. It's easy to work with, and easy to debug if needed in the future.

One thing I thought about is regarding the selectByPlanetName method. It's almost impossible to read. Maybe we could implement a __str__ and a __repr__ method like mentioned in the accepted SO answer here. Then it can be printed in a nice and readable way. E.g. by putting the error bars directly on the parameters instead of listing them separately. What do you think?

sczesla commented 8 years ago

Hi Daniel,

yes, definitely a good idea. selectPlanetByName returns only a dictionary, which is of course arbitrarily complex to disentangle on screen. I am not sure whether a __str__ or __repr__ are the ways to go here as long as ExoplanetEU2 is concerned. Maybe if selectPlanetByName would return a class object with such a method. Nonetheless, I set up the readableInfo method, which produces output along the lines you indicated.

from PyAstronomy import pyasl
v = pyasl.ExoplanetEU2()
v.readableInfo("CoRoT-2 b")

Not sure whether the name makes a lot of sense, but anyway, what do you think?

Stefan

DanielAndreasen commented 8 years ago

Only comment is, that I would probably not have the function return something. When I run it in a python/IPython shell, it prints the output very nicely (great work by the way), and then gives me a list which is also printed. Otherwise, I like the function. You could maybe merge readableInfo and selectByPlanetName, so the latter returns the dictionary as it already does, but print it in the way the new function does. Then you don't have two functions which does almost the same.

sczesla commented 8 years ago

Thank you Daniel. This is the idea I was searching for. I merged the output into selectByPlanetName, which seems much more natural.

Although some similar surgery may be desirable for the other data bases, I suggest to close this for the moment.

Cheers, Stefan

DanielAndreasen commented 8 years ago

It's perfect. And indeed, I thought of something like a verbose flag, which you also implemented! :+1: