A few thoughts on package structure and usage

1313e commented 4 years ago

In order to get SAGE analyzed using PRISM, I have started writing a SAGELink wrapper class that PRISM can use for that. As I therefore require to be able to process the output from SAGE, in this case calculating the SMF, I would like to use sage-analysis to do exactly that. However, I find myself getting incredibly frustrated with using it, mostly due to the different classes in the package being very disconnected from each other. I would therefore like to share my thoughts on how this can be improved (note that they apply to both the master and the dev branch).

Before I do so, I would like to share what exactly the situation is:

I have evaluated SAGE once using a single core and a single output snapshot (63);
The results from SAGE were written to HDF5;
I would like to obtain the SMF at snapshot 63.

According to the documentation, I require two things: An instance of the Model class (which stores all model information); and an instance of a subclass of the DataClass class (which handles the model data), in my case the SageHdf5Data class. So, I first attempt to initialize the Model class. Surprisingly enough, on the master branch, the Model class takes no mandatory arguments, even though it cannot be used if it is not connected to output of SAGE. After all, it cannot run SAGE itself, so it must be connected to output. On the dev branch on the other hand, the Model class takes many arguments that are not required in any way, as they are saved in the SAGE output already. Therefore, the Model class should only take 1 argument: The path to the master HDF5-file with the SAGE output. Everything else is already known at that point once that file is read.

This brings me to the second thing I noticed: No data is read. If I attempt to initialize the SageHdf5Data class, using an instance of Model (this should actually be done automatically upon me providing the path to the output file), nothing actually happens. The SageHdf5Data does not read in any data or attributes stored in the HDF5-file at all. This is strange, as the Model class requires these attributes. The only thing that happens is that the open HDF5-file is stored in Model.hdf5_file, but nothing is read from it.

Additionally, it is strange that the Model instance that I provide to SageHdf5Data does not get stored as an attribute in the instance (and vice-versa). This means that any time that the latter instance required the former, I have to provide it. Even though it knows what instance should be used, as it was provided during initialization.

This becomes a problem when I try to use calc_SMF (https://github.com/sage-home/sage-analysis/blob/master/sage_analysis/example_calcs.py#L24-L60). First of all, this function should be a method of Model. Not only because this function requires an instance of it, but also because it modifies attributes in the Model instance. The latter is only allowed by methods of that instance. Never by a stand-alone function. In case of a stand-alone function, the results should simply be returned instead of saved in the instance.

Secondly, the function requires the gals argument, which can be obtained from DataClass.read_gals, which again requires the Model instance. Why does the calc_SMF function not simply ask for the snapshot the SMF should be calculated for? If Model contains a pointer to the instance of DataClass (and vice-versa), then all required information is there.

However, even if I provide all this information, calc_SMF still cannot be executed. After all, the Model instance does not have the hubble_h property set, as no attributes of the SAGE model were read in automatically anywhere. This is also weird, as instance properties should ALWAYS be set. Either to a default value or to a value that was determined somewhere. Getting an instance property should never raise an error, in any situation.

Unless I have made major errors in my process described above, this is not acceptable. A user would require too much knowledge and effort to perform simple tasks, that the programmer can do for them already. In case of sage-analysis, the biggest problem is that the Model and DataClass classes are not connected to each other in any way, even though they require each other in order to function. Besides that, a user would expect that when a class is initialized that serves to handle the data stored in a specific file, that it then also reads that data whenever necessary. It should not be necessary to read that data separately, and also assign it separately, as that completely negates the purpose of said class.

What I recommend doing is going over all the definitions in the entire package, and check everywhere whether it makes sense that user input is required for something. A user should ideally never have to provide any information more than once, directly or indirectly. Making sure that it works that way will significantly improve the user experience.

Let me know if there are any questions that I need to answer, or if more information is required.

jacobseiler commented 4 years ago

Hey @1313e thanks for your thoughts and insights.

I think there's 2 things that should be discussed here.

Resolving how to do what you want (your "situation" dot points),
Comments/thoughts on the package structure

For number 2, I am still processing your comments and thinking which aspects I agree/disagree on. I will make a more in-depth reply or code modifications in a future update. It should be noted that majority of the package was written in 2019 during my thesis write-up, with major rewrites occurring in December last year and again this year in April-ish, with me having pretty much not using the package at all in the meantime. As a result, its not surprising that there is a lot of disconnect and conflicting ideologies as my coding style/priorities changed.

For number 1, I have recently just merged into Master a very big update to the API format. In particular, there is now a more simplified access flow. Your scenario, where you want to compute the stellar mass function, is covered in the documentation. Please let me know if it works/does not work for what you want.

manodeep commented 4 years ago

Fantastic @jacobseiler. And thanks to you @1313e for being the first "external" user to try out sage_analysis. The updates that are feasible and in-scope will make the package much more user-friendly.

@jacobseiler Separately, now that we can run sage from python - what do you think about making the sage_analysis package a sub-package within a sage-model python package?

1313e commented 4 years ago

Someone's got to be the first one, right...?

1313e commented 4 years ago

Alright, I can finally get back to this. Below are some comments while I am working through this part of the documentation.

The typehinting makes it incredibly hard to read the documentation properly (for example, the docs of GalaxyAnalysis);
Something that I mentioned through Slack as well: It is confusing that I have to provide the SAGE parameter file instead of the HDF5-file where the data is stored in. After all, the HDF5-file contains all information of the parameter file, making the latter pointless. One way to solve this is to designate a directory that belongs to SAGE; no one is allowed to touch anything inside that directory. For HDF5, the parameters are stored in the file. For binary, SAGE copies the parameter file to that directory as well using a standardized name. When analyzing the output, the user then provides the path to the directory. This would also make it easier to store plots, as you can store them conveniently in the directory as well.
After initializing GalaxyAnalysis, why is analyze_galaxies not executed automatically? Can the GalaxyAnalysis object be useful in any way before that method is executed?
Why is Model.properties a dict containing the keys snapshot_xx with xx being a snapshot number? If it solely contains snapshots, it is much more logical if I can index this dict using the snapshot number.
Speaking of this dict, why does it contain huge empty dicts/lists/arrays/etc. for every snapshot that SAGE has, even if nothing will ever be read into them? If analyze_galaxies is executed automatically upon initializing the class, the program knows what snapshots are available and thus making all these empty containers is not necessary.
The plot_toggles argument of GalaxyAnalysis is misleading. If I toggle something off, then that should only toggle the plotting of that property, not the analysis. Currently, there is no way to toggle off the plotting of a property, but still read it in. This is made even more confusing by the presence of the galaxy_properties_to_analyze argument, which suggests that it toggles the properties to analyze.
Why is the plot_toggles argument a dict of bool? Just a list of str would be much easier.
The online documentation does not link to anywhere how the galaxy_properties_to_analyze argument works, as that one definitely requires some explanations. I guess this kind of does the job, but no link to it is given.
If you initialize an empty list with values, don't use values that hold a meaning for that property (like zeros). Use something like a None instead, as that holds no meaning at all (e.g., a zero for the SMF is perfectly possible, a None is not. The latter would inform the user immediately that the data is not available.)
The SMF property returns the actual number of galaxies per mass bin in the simulation, instead of the simulation-independent number density. How can one obtain the actual proper number density?

manodeep commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

1313e commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

manodeep commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

Does that mean you could successfully run sage with PRISM?

1313e commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

Does that mean you could successfully run sage with PRISM?

No, as the number density comment has not been dealt with, right?

manodeep commented 4 years ago

@1313e Ahh - yes, of course!

@jacobseiler What's the recommended way to compute the co-moving number-density within sage_analysis?

jacobseiler commented 4 years ago

To compute the number density of galaxies divide the number of galaxies by the simulation volume * the bin widths.

In particular, given a specific model output,


bin_widths = model.bins["stellar_mass_bins"][1::] - model.bins["stellar_mass_bins"][0:-1]
normalization_factor = model._volume / pow(model.hubble_h, 3) * bin_widths  # Assumes Simulation was specified in Mpc/h

for snapshot in snapshots:
            norm_SMF = model.properties[f"snapshot_{snapshot}"]["SMF"] / normalization_factor

1313e commented 3 years ago

Now that OzSTAR is usable again, I can keep going with this. However, most of the comments above are usability comments, and if they would be taken into account, I will have to rewrite my code again as well.

1313e commented 3 years ago

Oh, btw, can I turn off the printing of all the information to screen somewhere? I don't think a user wants to see that a few 1000 times.

1313e commented 3 years ago

Is there any comparison SMF data that I can use to constrain SAGE with? Of course, the data must have matching logM bins as the output of SAGE.

jacobseiler commented 3 years ago

SAGE compares against the Baldry 2008 paper.

    Baldry = np.array([
        [7.05, 1.3531e-01, 6.0741e-02],
        [7.15, 1.3474e-01, 6.0109e-02],
        [7.25, 2.0971e-01, 7.7965e-02],
        [7.35, 1.7161e-01, 3.1841e-02],
        [7.45, 2.1648e-01, 5.7832e-02],
        [7.55, 2.1645e-01, 3.9988e-02],
        [7.65, 2.0837e-01, 4.8713e-02],
        [7.75, 2.0402e-01, 7.0061e-02],
        [7.85, 1.5536e-01, 3.9182e-02],
        [7.95, 1.5232e-01, 2.6824e-02],
        [8.05, 1.5067e-01, 4.8824e-02],
        [8.15, 1.3032e-01, 2.1892e-02],
        [8.25, 1.2545e-01, 3.5526e-02],
        [8.35, 9.8472e-02, 2.7181e-02],
        [8.45, 8.7194e-02, 2.8345e-02],
        [8.55, 7.0758e-02, 2.0808e-02],
        [8.65, 5.8190e-02, 1.3359e-02],
        [8.75, 5.6057e-02, 1.3512e-02],
        [8.85, 5.1380e-02, 1.2815e-02],
        [8.95, 4.4206e-02, 9.6866e-03],
        [9.05, 4.1149e-02, 1.0169e-02],
        [9.15, 3.4959e-02, 6.7898e-03],
        [9.25, 3.3111e-02, 8.3704e-03],
        [9.35, 3.0138e-02, 4.7741e-03],
        [9.45, 2.6692e-02, 5.5029e-03],
        [9.55, 2.4656e-02, 4.4359e-03],
        [9.65, 2.2885e-02, 3.7915e-03],
        [9.75, 2.1849e-02, 3.9812e-03],
        [9.85, 2.0383e-02, 3.2930e-03],
        [9.95, 1.9929e-02, 2.9370e-03],
        [10.05, 1.8865e-02, 2.4624e-03],
        [10.15, 1.8136e-02, 2.5208e-03],
        [10.25, 1.7657e-02, 2.4217e-03],
        [10.35, 1.6616e-02, 2.2784e-03],
        [10.45, 1.6114e-02, 2.1783e-03],
        [10.55, 1.4366e-02, 1.8819e-03],
        [10.65, 1.2588e-02, 1.8249e-03],
        [10.75, 1.1372e-02, 1.4436e-03],
        [10.85, 9.1213e-03, 1.5816e-03],
        [10.95, 6.1125e-03, 9.6735e-04],
        [11.05, 4.3923e-03, 9.6254e-04],
        [11.15, 2.5463e-03, 5.0038e-04],
        [11.25, 1.4298e-03, 4.2816e-04],
        [11.35, 6.4867e-04, 1.6439e-04],
        [11.45, 2.8294e-04, 9.9799e-05],
        [11.55, 1.0617e-04, 4.9085e-05],
        [11.65, 3.2702e-05, 2.4546e-05],
        [11.75, 1.2571e-05, 1.2571e-05],
        [11.85, 8.4589e-06, 8.4589e-06],
        [11.95, 7.4764e-06, 7.4764e-06],
        ], dtype=np.float32)

    Baldry_xval = np.log10(10 ** Baldry[:, 0] / hubble_h / hubble_h)
    if imf == "Chabrier":
        # Convert the Baldry data to Chabrier.
        Baldry_xval = Baldry_xval - 0.26

    Baldry_yvalU = (Baldry[:, 1]+Baldry[:, 2]) * pow(hubble_h, 3)
    Baldry_yvalL = (Baldry[:, 1]-Baldry[:, 2]) * pow(hubble_h, 3)

Baldry_xval is the log10 Msun value and yvalU/yvalL is the upper/lower bounds in units of Mpc^-3 dex^-1.

1313e commented 3 years ago

Those values are incompatible with SAGE, as SAGE does not calculate the SMF at these logM values.

darrencroton commented 3 years ago

Just to clarify ... The SMF needs to be calculated from the data that SAGE generates. This is true for all models (and astronomy data in general). The SMF is not a property of the galaxies.

1313e commented 3 years ago

Just to clarify ... The SMF needs to be calculated from the data that SAGE generates. This is true for all models (and astronomy data in general). The SMF is not a property of the galaxies.

Yeah, I know. But SAGE does not calculate the SMF values at these logM values, therefore making it incompatible for comparison with the given data.

darrencroton commented 3 years ago

I don't understand what you mean. Stellar mass is a galaxy property in SAGE with units 10^10 h-1Msun. That can be converted to the same logM as the Baldry data (i.e. multiply by 10^10, divide by little h, take the log10). Baldry is given that way because it's how the SMF is usually plotted. Does that answer your question?

jacobseiler commented 3 years ago

I think the confusion may be that we're using two definitions of SAGE interchangeably,

SAGE: the semi-analytic model that does the galaxy evolution
SAGE-analysis: the code that reads the output and does the computation of the SMF.

Currently, SAGE-analysis is computing the SMF in log10 Msun bins from [8.0, 12.0] in widths of 0.1. Wheras Baldry Is from [7.05, 11.95] in widths of 0.1. I belive Ellert is saying that because these bins are different, it is not valid to compare the SMF produced by default SAGE-analysis and that of Baldry 2008.

1313e commented 3 years ago

Uhm, yes, that is what I am saying. I could not find anywhere where I can set what bins to calculate the SMF for.

jacobseiler commented 3 years ago

Try this. Change the bin_low, bin_high, bin_width values and point to the correct parameter file. It should compute only the stellar mass function in the galaxy_analysis.models[0].properties["SMF"] attribute.

(Yes I am very aware that this is hacky, ill-thought out, badly constructed etc etc)


from sage_analysis.galaxy_analysis import GalaxyAnalysis
from sage_analysis.default_analysis_arguments import default_galaxy_properties_to_analyze

par_fnames = ["/home/Desktop/sage-model/input/millennium.ini"]

properties_to_analyze = default_galaxy_properties_to_analyze.copy()
properties_to_analyze["stellar_mass_bins"] =  {
    "type": "binned",
    "bin_low": 8.0,
    "bin_high": 12.0,
    "bin_width": 0.1,
    "property_names": ["SMF"],
},

galaxy_analysis = GalaxyAnalysis(par_fnames, properties_to_analyze=properties_to_analyze, plot_toggles={"SMF": True})
galaxy_analysis.analyze_galaxies()

1313e commented 3 years ago

properties_to_analyze is not a valid keyword argument of GalaxyAnalysis. I think you mean galaxy_properties_to_analyze. Which could be shortened to just analyze_props.

1313e commented 3 years ago

Is there a way to disable all printing output from both SAGE and sage_analysis?

manodeep commented 3 years ago

@1313e Are you using the cffi branch of sage?

1313e commented 3 years ago

@1313e Are you using the cffi branch of sage?

Yes

manodeep commented 3 years ago

I have an easy but hacky way. After you pull in the latest commit on the cffi branch, add the following line

freopen("/dev/null, "w", stdout);

at the top of the function init_sage and then add this line

freopen("/dev/tty", "w", stdout);

at the end of finalize_sage. That should do the trick by redirecting all standard output into the blackhole of /dev/null and then restoring after sage is done.

(Correctly solving what you are requesting is a bit more involved)

sage-home / sage-analysis

A few thoughts on package structure and usage #11