Clean API Core needed - Githubissues

mfouesneau commented 8 years ago

There is more and more spikes added to big pile of scripts. Examples are not clear or completely absent. It makes it very hard to even attempt to integrate existing code to anything.

A solid API core is needed to properly build tools upon it.

philrosenfield commented 8 years ago

There is more and more spikes added to big pile of scripts.

Looking at the git log, I'm guessing this is directed at my recent commits. The big spike you're seeing is likely the example data I added yesterday. (In the future, it would be better to host example data off site so it doesn't add to the repo size.)

By big pile of scripts, are you referring to the entire repo or the directory called "scripts?" The directory is simply my repo and Dan and I haven't yet decided on a way to merge them without breaking our workflows.

Examples are not clear or completely absent.

Examples and template files are in the works, what exactly were you looking for, perhaps we can make that a priority.

It makes it very hard to even attempt to integrate existing code to anything. A solid API core is needed to properly build tools upon it.

Unless I'm deeply confused, a solid API core and the ability for others to build tools upon this code base has not yet been on a to-do list let alone listed as a priority.

So far, this repo is student-focused, not general researcher-focused. When Dan approached me with the idea of creating a repo to aid future students learning to use MATCH, I imagined command line utilities that could seem like an extension to how they interacted with MATCH.

It would be great if we could build an API and refactor the code base such that it continues to be a set of command line utilities picking up where MATCH has left off as well as a place for other to build tools off of. If I'm going to be the one doing that development, I could use an example or two of a python API that you think would be useful to imitate.

mfouesneau commented 8 years ago

I looked into how to integrate existing code into my example to make it useful for others, but it's impossible to figure at this stage on my own. Now it makes more sense if this is not the point of it, so that's fine.

My strategy, if I had time, would be to merge reading and writing MATCH files into one package, one to plot things. Then you can build around to get command line and other tools. It's fine to have many submodules as long as they are documented or used in a clear example.

philrosenfield commented 8 years ago

I think we're actually on the same page on this. It'd be useful to have a concrete example or two of what thinking of so I know I'm on at least a parallel track.

Here's a guess: https://github.com/dweisz/match/blob/master/scripts/examples/Make%20a%20calcsfh%20input%20parameter%20file.ipynb

dweisz commented 8 years ago

Part of the reason for the lack of a clean API is that we haven't decided how to handle the main use cases. For example, a regular calcsfh run vs. an ssp require slightly different input/output. This isn't an insurmountable hurdle, but it's not clear if we should have a single package that handles all MATCH reading/writing, or separate routines for ssp vs. non-SSP.

There are other minor issues like how to deal with arbitrary time bin resolution. The answer is probably to pass two arrays, one with time 'break points' and the other with the desired time resolution per specified internal (e.g., I want 0.1 dex resolution from log(t) = 6.6 - 9.0 and 0.05 dex resolution from log(t) = 9.0 - 10.15)

Phil, that notebook provides several great examples.

I just have been swamped and haven't had much time to contribute lately.

philrosenfield commented 8 years ago

Thanks, Dan. I think that break points wouldn't be too difficult to implement.

What about adding tbreakas a list of break points and allowing tbin to be an arraylen(tbreak) + 1 so in your example, tbreak = [9.0], tbin=[0.1, 0.05].

dweisz commented 8 years ago

I think you’d want to specify the upper and lower bounds too. These vary from model to model, and it’s not always that case that you’ll want to go from youngest to oldest. e.g., might want

tbreak = [7.0, 8.0, 9.0, 10.0], tbin=[0.1, 0.05, 0.02]

where the first and last values of tbreak are then tmin and tmax. What do you think?

On Jun 7, 2016, at 9:16 AM, Phil Rosenfield notifications@github.com wrote:

Thanks, Dan. I think that break points wouldn't be too difficult to implement.

What about adding tbreakas a list of break points and allowing tbin to be an arraylen(tbreak) + 1 so in your example, tbreak = [9.0], tbin=[0.1, 0.05].

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dweisz/match/issues/11#issuecomment-224332693, or mute the thread https://github.com/notifications/unsubscribe/ABKnY0E8I9KNq3kyMM3bsTcZy_5CHlyiks5qJZlZgaJpZM4Ivq41.

mfouesneau commented 8 years ago

If you have the parameter file, I guess you have them all?

On Tue, Jun 7, 2016 at 6:24 PM Dan Weisz notifications@github.com wrote:

I think you’d want to specify the upper and lower bounds too. These vary from model to model, and it’s not always that case that you’ll want to go from youngest to oldest. e.g., might want

tbreak = [7.0, 8.0, 9.0, 10.0], tbin=[0.1, 0.05, 0.02]

where the first and last values of tbreak are then tmin and tmax. What do you think?

On Jun 7, 2016, at 9:16 AM, Phil Rosenfield notifications@github.com wrote:

Thanks, Dan. I think that break points wouldn't be too difficult to implement.

What about adding tbreakas a list of break points and allowing tbin to be an arraylen(tbreak) + 1 so in your example, tbreak = [9.0], tbin=[0.1, 0.05].

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/dweisz/match/issues/11#issuecomment-224332693>, or mute the thread < https://github.com/notifications/unsubscribe/ABKnY0E8I9KNq3kyMM3bsTcZy_5CHlyiks5qJZlZgaJpZM4Ivq41 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dweisz/match/issues/11#issuecomment-224334938, or mute the thread https://github.com/notifications/unsubscribe/ABoANK_Lyt2juOS2eyDnjNOf_m-3U0g9ks5qJZsWgaJpZM4Ivq41 .

philrosenfield commented 8 years ago

I think having tmin and tmax within tbreak isn't a good idea since it would mean doubly-defined arguments, since tbreak is optional. I added an implementation and an example in 030c381

dweisz commented 8 years ago

yes, sorry I forgot that tmin and tmax were already defined. Definitely don’t want to doubly define it.

Getting back to the main issue of a cleaner API and interface — at minimum, I think we will eventually want to provide a few end-to-end examples (from determining completeness to plotting for a few common use cases). I have been promising this for a while, but just haven’t had time...

On Jun 7, 2016, at 10:19 AM, Phil Rosenfield notifications@github.com wrote:

I think having tmin and tmax within tbreak isn't a good idea since it would mean doubly-defined arguments, since tbreak is optional. I added an implementation and an example in de12a80 https://github.com/dweisz/match/commit/de12a80354c20d7e7caed24b4ef12057d831b6c4 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dweisz/match/issues/11#issuecomment-224350894, or mute the thread https://github.com/notifications/unsubscribe/ABKnY_0j44WQG1fLR1QrB3xipp8tR271ks5qJagYgaJpZM4Ivq41.

philrosenfield commented 7 years ago

Here is my rough idea for a plan. I am open to suggestions.

Split core utilities from classes and functions for easier access (already started with scripts.utils.marg and scripts.utils.marg2d and their use in scripts.ssp.SSP)
Functions use either pd.DataFrame or np.recarray and not, for example classes defined here. -
Streamline scripts directory and main directory
Refactor:
- easy access to reading and modifying match input and output files
- plotting input and output files (as diagnostics)
- doing statistical analysis
Re-format doc strings for sphinx or some other doc parser.

mfouesneau commented 7 years ago

My 2cents comment: if you do not need pandas specific attributes or calculations, make everything works with dictionaries. Recarray or DataFrame or xarray or similar will work without pain and you have no external dependency.

philrosenfield commented 7 years ago

I agree with limiting required external dependencies.

I started using pandas to add columns to an array. It's trivial and fast. That's the only reason, but I can imagine using more of their file writing functions.

For example, in match/scripts/fileio.py line 78, Say I find _bf0.35_ in the filename and want bf as a column, and 0.35 for each row added to the calcsfh output array:

    new_stuff = filename_data(fname)
    # print('adding columns: {}'.format(new_stuff.keys()))
    for name, val in new_stuff.items():
        df[name] = val

philrosenfield commented 7 years ago

Wait -- did I misunderstand? DataFrame is accessible outside of pandas?

mfouesneau commented 7 years ago

You can look at my simpletable.py file in the membership example. it wraps ndarray to offer dict access and reading/writing. (not advocating we should use/impose it.)

DataFrame or DataFrame like objects are also in Dask and Xarray. Mostly because of the many operations it offers (some faster than numpy). But the common ground is using named columns. So that a dictionary is the simplest, built-in common interface. if your function works with dict it works with any.

philrosenfield / match

Clean API Core needed #11