spacetelescope / exovetter

Exoplanet vetting
https://exovetter.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6 stars 5 forks source link

A sketch of an architecture for the exovetter #11

Closed fergalm closed 4 years ago

fergalm commented 4 years ago

Susan asked me to put together an rough idea for how the code should be laid out. Consider this a rival to PR #9

Comments welcome

fergalm commented 4 years ago

I don't think I agree that we're locked into using dataframes. The rows of a dataframe can be accessed with the same syntax as a dictionary (e.g row = df.iloc[0]; print(row['column1']), so the use of dataframes in this main function doesn't drive design of any other part of the code. For example, if the source of the TCEs becomes a FITS binary table, we would have to re-write this main function, but no other part of the code.

In terms of making this part of the code more object orientated, I'd look to you to make a suggestion here. As I see it this particular part of the code is fundamentally procedural (loop through each tce, then loop through each vetter). I could wrap it in a class, but I'd still have a method or set of methods doing these iterations.

I'm not sure what you mean by "not sure about the loops." Are you saying they're unnecssary, or too nested? I could simplify main by saying

def main():
     tceTable = ...

    for i, row in tceTable.iterrows():
         metrics = vetSingleTce(row, vetterList)

def vetSingleTce(row, vetterList)
     lightcurve = loadLightcurve(...)

    for v in vetterList:
         metric.update( v(row, lightcurve)

Which yields smaller functions (always a good thing), but the same basic design.

jbcurtin commented 4 years ago

My experience as a Data Engineer, the first thing I'll remove to help this code move faster is pandas Dataframes. If we use dask Dataframes, we'll all be learning the API together and I can more easily help speed up the code computation without having to rewrite a majority of the codebase

pandas Dataframes and Dask Dataframes are very different. pandas DFs are created to be human-centric while dask DFs are meant to make it really convenient to run computations concurrently. I wrote a module some months back to streamline the utilization of dask in the my notebooks. https://jbcurtin.io/notebooks/data-scripts-and-Dask.html. ( I'll probably move a lot of the data-scripts stuff into bert-etl )

pllim commented 4 years ago

Re: https://github.com/spacetelescope/exovetter/pull/11#issuecomment-660109990

We might want to avoid tceTable.iterrows and think about vectorization (either with Dask or multiprocessing). But then again, if your focus is small scale usage, maybe the overhead of those thingies are not worth it. So it really depends on how this package intends to be used.

By vectorization, I mean something like vetter_instance.do_some_vetting(tce_table). The method will know how to vectorize.

I think what's missing in this sketch is how you want the user to interface with this package. Perhaps a "test file" under tests to illustrate this like #9 ? I feel like main can be moved there. I don't think there should be a main in the actual package code.

pllim commented 4 years ago

The tce and lightcurve are separate arguments because there isn't a one-to-one mapping between TCEs and lightcurves -- two TCEs can belong to the same star.

That's good to know. Not sure how to best represent this yet but ideally, the user shouldn't have to pass the same lightcurve in over and over again for different TCEs. 💭

fergalm commented 4 years ago

OK, in response to the comments I've made a couple of changes Vetter.apply() is now Vetter.run(). If someone feels strongly about a better name we can change it again.

I've created a function vetTce(). This does all the work needed for a single TCE, and returns a dictionary of vetting parameters.

I've removed the main function. I wasn't intending it to be an argument in favour of dataframes over, say, dask, and we can choose the technology that suits us best once we have some running code to work on.

pllim commented 4 years ago

Re: https://github.com/spacetelescope/exovetter/pull/11#issuecomment-660548536

@fergalm , I don't see the change from apply to run. If that commit is lost in a recent push, you might need to re-push (with force if needed) from your local checkout.

pllim commented 4 years ago

p.s. For future reference, might be safer to open a PR from your own fork.

mustaric commented 4 years ago

I added a file to tests called lpp-test.py to show how I might work with Fergal's way of structuring the code. In the end I'm convinced, but it did bring up some issues about how to pass around light curves. I think we can punt on those issues and write in functions depending on how we want to input and output the data later on.

Likely we want to use lightkurve objects (from the lightkurve software package) to pass around the light curve to our vetters. This gives us the option of storing more than one type of light curve (detrended or not) in the same object. So this means we can then run modshift on two different flux time sieries stored in the same object. But we would initialize the modshift object by specifying the light curve it runs on first (lc = "SAP_FLUX" for example).

Other questions that occurred to me..

fergalm commented 4 years ago

@pllim , indeed, I had commited but not pushed. Code is updated now.