#8 Code review of master + finishing unit tests

msimet commented 10 years ago

Hello all,

Here's a PR for issue #8, which was intended to finish up the unfinished unit test suite, as well as act as a place for people to review the code on master that hasn't already been reviewed. As part of getting the unit test suite working, in addition to some other concerns driven by branches #7 and #14, there's actually a lot of updated code in here, so going through this PR may take some time.

Also, since this is intended to be a code review of master, you might not want to use the "files changed" interface as usual, but instead look at the whole code as of the latest commit--right now that should be this link. The main code is in the stile/ directory, tests in tests/, and example code in examples/. Or just pull branch #8. That'll be a little annoying for line-by-line comments, I'm afraid, though I suspect most of the code is going to show up in the "files changed" pane anyway since there were a lot of small tweaks.

A tour of the code in stile/ as-is, roughly in order of importance/amount of review needed:

sys_tests.py - the main systematics test code file. Right now this has some correlation function code that operates via corr2 (CorrelationFunctionSysTest and children), as well as a Stats sys test (StatSysTest) that was part of branch #6/PR #15--so that's already been reviewed.
corr2_utils.py - all the things that CorrelationFunctionSysTest needs in order to talk to corr2. This includes some validation of the parameters to write to a config file (CheckArguments); some functions to write config files for corr2 (WriteCorr2ConfigurationFile) and read the results of a run of corr2 (ReadCorr2OutputFile); to harvest the corr2-related arguments from a dict containing other parameters as well (AddCorr2Dict--see caveats below); and figure out which data sets need to be written to files on disk and make sure they all have the same columns corresponding to ra/dec/g1/g2/etc (MakeCorr2FileKwargs), plus a bunch of helper functions and variables to make those things easier.
file_io.py - wrapper functions for reading/writing ASCII and FITS files in various ways.
stile_utils.py - other utility functions. The Stats class was already checked in #6/#15, so the other main block of code to look at is the FormatArray function, which turns arrays which are not NumPy formatted arrays into formatted arrays (ie things which allow you to ask for arr['ra'] to get all the RA values as well as arr[0] to get all the data for object 0). These are not formally numpy.recarrays, but they're similar.
binning.py - describes some Bin* objects which set up a binning scheme and, when called, return a list of SingleBin objects, which contain the bounds of the bin, each one of which can be called on a data array to return only the data within the bin.
data_handler.py - this is mostly abstract, but defines the kind of interface we'd want for non-specialized (ie non-LSST/HSC) users of Stile to get their data into the program.
__init__.py - loads stuff into the stile namespace.

In addition to those changes, per @HironaoMiyatake's suggestion on #2, everything in tests/ (including the new stuff) has been ported over to the unittest framework. We still need to use the numpy.testing functions for arrays--numpy.array_equal seems to not work in quite the same way as numpy.testing.assert_array_equal--which means that failing tests will produce a combination of Errors (from numpy.testing) and Failures (from unittest). Still, it seems to have some nice summary functionality that doesn't require external packages. For those who haven't used unittest before, you can run all the tests in the directory at once via python -m unittest discover, or run the individual test files like normal Python scripts.

This also means all the testing files are going to show up in near-entirety in the changed files pane, for those who want line-by-line commenting

Things I'm explicitly still unsure about:

Right now corr2_utils.AddCorr2Dict() takes a dict of parameters, pulls out the ones that can be used by corr2, and creates a new copy of the input dict with an added corr2_kwargs key that contains a dict with only the parameters useful for corr2. Better to have it just return the thing that's in new_dict['corr2_kwargs'] instead? I can't decide.
The SingleBin-type objects in binning.py used to return Boolean masks and now just return the already-indexed original array. (That is, you used to have to data = data[bin(data)] and now you just data = bin(data). I can see use cases for both, but I think the second thing is simpler enough (and the gains you might get from and-ing the masks rather than masking sequentially small enough) to leave it this way.
I was pretty much coming up with names for things on the fly, so if it seems obscure/convoluted/just wrong, say something!

Other than that, happy for any thoughts anyone has as well.

HironaoMiyatake commented 10 years ago

The SingleBin-type objects in binning.py used to return Boolean masks and now just return the already-indexed original array. (That is, you used to have to data = data[bin(data)] and now you just data = bin(data). I can see use cases for both, but I think the second thing is simpler enough (and the gains you might get from and-ing the masks rather than masking sequentially small enough) to leave it this way.

I agree that the second thing is simpler enough. I found that SingleFunctionBin keeps both features. If we decide to use only the second one, do we want to make it the same as SingleBin?

msimet commented 10 years ago

I found that SingleFunctionBin keeps both features. If we decide to use only the second one, do we want to make it the same as SingleBin?

SingleFunctionBin doesn't keep both features: it's always binned_data = SingleFunctionBin(data). The difference is the function that's passed as an argument to SingleFunctionBin: it can return either the mask of bools or an array of bin numbers, but the SingleFunctionBin itself always just returns the binned data.

HironaoMiyatake commented 10 years ago

I see. I misunderstood!

msimet commented 10 years ago

And a conversation @HironaoMiyatake had earlier today: at the moment, Stile uses formatted arrays that are not formally numpy.recarrays. I don't have a particular principled reason for this--it's just that I can make formatted numpy arrays, but always run into trouble trying to make recarrays. If anyone proposes a switch, I'm happy to either let you do it or figure out why myself.

For those who haven't used one or both of those things before, as far as I can tell, the main difference is that numpy.recarrays let you have multiple aliases for the same column, while formatted numpy arrays only allow one. There are some pieces of the code right now that assume there's a 1:1 correspondence between field name and field order, so we'd have to make sure that didn't break if we allow recarrays in.

msimet commented 10 years ago

That should read, "figure out how"

HironaoMiyatake commented 10 years ago

I do not have a strong opinion here. I thought it might confuse those who are familiar with numpy.recarray, but FormatArray is close enough, so there would be no problem.

rmandelb commented 10 years ago

Hi Melanie - All tests pass for me on coma (no surprise; I would guess you've checked there). I am having some issues on my Mac:

In test_corr2_utils.py, I get these errors (not failures):

ERROR: test_MakeCorr2FileKwargs (__main__.TestCorr2Utils)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_corr2_utils.py", line 266, in test_MakeCorr2FileKwargs
    result = stile.MakeCorr2FileKwargs(data)
  File "../stile/corr2_utils.py", line 813, in MakeCorr2FileKwargs
    new_data_list.append(OSFile(data_list,fields=fields))
  File "../stile/corr2_utils.py", line 586, in __init__
    file_io.WriteTable(self.file_name,self.data,fields=self.fields)
  File "../stile/file_io.py", line 228, in WriteTable
    WriteFITSTable(file_name,data_array,fields)
  File "../stile/file_io.py", line 194, in WriteFITSTable
    table.data = numpy.array(table.data).view(table._data_type)
AttributeError: 'BinTableHDU' object has no attribute '_data_type'

======================================================================
ERROR: test_OSFile (__main__.TestCorr2Utils)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_corr2_utils.py", line 227, in test_OSFile
    OSFile2 = stile.corr2_utils.OSFile(arr2)
  File "../stile/corr2_utils.py", line 586, in __init__
    file_io.WriteTable(self.file_name,self.data,fields=self.fields)
  File "../stile/file_io.py", line 228, in WriteTable
    WriteFITSTable(file_name,data_array,fields)
  File "../stile/file_io.py", line 194, in WriteFITSTable
    table.data = numpy.array(table.data).view(table._data_type)
AttributeError: 'BinTableHDU' object has no attribute '_data_type'

Similar problems show up in test_corr2_utils.py.
A smaller issue comes up in test_correlation_functions.py:

test_getCorrelationFunctionSysTest (test_correlation_functions.TestCorrelationFunctions) ... Error: Required parameter ra_units not found

But then the test passes, so I'm not sure what gives.

msimet commented 10 years ago

The second one's not an error--or rather it's an expected error: we send some wrong stuff to corr2 and make sure it fails (to check it's not caching earlier results basically); the error message you see is what corr2 prints before it exits, and the exit is caught by the testing script which expects it. I can try to redirect the output so you don't see that. (There's a note if you run test_corr2_utils.py directly, but not if you do the unittest discover thing.)

As for the _table_view error, that's related to pyfits. I'll have to think about that a little more--that line of code is a workaround required for table writing to work on the version of pyfits I was using on Coma. Can I ask which version of pyfits you've got on your Mac?

rmandelb commented 10 years ago

Pyfits v3.0.3.

If you want to just come over to my office sometime and we can try to sort it out directly instead of you sending suggestions that I have to try out and reply to, that's fine. :)

rmandelb commented 10 years ago

BTW, I noticed that the README is lacking info like required dependencies. I think it would be good to include those as part of this PR if you don't mind.

msimet commented 10 years ago

No, that sounds good--the list is pretty short, after all. :) I can make that edit too in a bit--it's just numpy, right, with the recommended dependencies of corr2 greater than whatever, matplotlib, pyfits/astropy?

rmandelb commented 10 years ago

Re: the dependency list - we should be clear just how much of the functionality depends on those optional dependencies. Actually is pyfits/astropy recommended or required? I don’t see how we can do very much without it.

msimet commented 10 years ago

Okay, I think I've addressed everything we talked about so far except the README, which I'm still putting together...

msimet commented 10 years ago

And I just pushed a quick first draft of a README too.

rmandelb commented 10 years ago

Looks good!

rmandelb commented 10 years ago

Hi Melanie - I just put some comments on bits of the code that you had mentioned would be good to check out. Do we need another person to go over the whole thing, or do you think we're okay with Hironao's and my comments?

msimet commented 10 years ago

Hmm. I guess I'd feel more comfortable if someone else went over it, but I'm not sure it's strictly necessary at this point.

I'll leave this open for now and plan to merge Thursday evening, unless somebody lets me know they're in the middle of a code review.

msimet commented 10 years ago

Oh--and I addressed all of your comments, I think.

rmandelb commented 10 years ago

The latest set of commits look good to me.

Re: another person to do code review, I think that if you would like one then your best bet is to choose a victim and e-mail them directly to ask if they are able to do so on a reasonable timescale.

msimet / Stile

#8 Code review of master + finishing unit tests #16