sage-home / sage-analysis

Package to ingest, analyse, and plot all the SAGE data products
MIT License
2 stars 5 forks source link

A few thoughts on package structure and usage #11

Open 1313e opened 4 years ago

1313e commented 4 years ago

In order to get SAGE analyzed using PRISM, I have started writing a SAGELink wrapper class that PRISM can use for that. As I therefore require to be able to process the output from SAGE, in this case calculating the SMF, I would like to use sage-analysis to do exactly that. However, I find myself getting incredibly frustrated with using it, mostly due to the different classes in the package being very disconnected from each other. I would therefore like to share my thoughts on how this can be improved (note that they apply to both the master and the dev branch).

Before I do so, I would like to share what exactly the situation is:

According to the documentation, I require two things: An instance of the Model class (which stores all model information); and an instance of a subclass of the DataClass class (which handles the model data), in my case the SageHdf5Data class. So, I first attempt to initialize the Model class. Surprisingly enough, on the master branch, the Model class takes no mandatory arguments, even though it cannot be used if it is not connected to output of SAGE. After all, it cannot run SAGE itself, so it must be connected to output. On the dev branch on the other hand, the Model class takes many arguments that are not required in any way, as they are saved in the SAGE output already. Therefore, the Model class should only take 1 argument: The path to the master HDF5-file with the SAGE output. Everything else is already known at that point once that file is read.

This brings me to the second thing I noticed: No data is read. If I attempt to initialize the SageHdf5Data class, using an instance of Model (this should actually be done automatically upon me providing the path to the output file), nothing actually happens. The SageHdf5Data does not read in any data or attributes stored in the HDF5-file at all. This is strange, as the Model class requires these attributes. The only thing that happens is that the open HDF5-file is stored in Model.hdf5_file, but nothing is read from it.

Additionally, it is strange that the Model instance that I provide to SageHdf5Data does not get stored as an attribute in the instance (and vice-versa). This means that any time that the latter instance required the former, I have to provide it. Even though it knows what instance should be used, as it was provided during initialization.

This becomes a problem when I try to use calc_SMF (https://github.com/sage-home/sage-analysis/blob/master/sage_analysis/example_calcs.py#L24-L60). First of all, this function should be a method of Model. Not only because this function requires an instance of it, but also because it modifies attributes in the Model instance. The latter is only allowed by methods of that instance. Never by a stand-alone function. In case of a stand-alone function, the results should simply be returned instead of saved in the instance.

Secondly, the function requires the gals argument, which can be obtained from DataClass.read_gals, which again requires the Model instance. Why does the calc_SMF function not simply ask for the snapshot the SMF should be calculated for? If Model contains a pointer to the instance of DataClass (and vice-versa), then all required information is there.

However, even if I provide all this information, calc_SMF still cannot be executed. After all, the Model instance does not have the hubble_h property set, as no attributes of the SAGE model were read in automatically anywhere. This is also weird, as instance properties should ALWAYS be set. Either to a default value or to a value that was determined somewhere. Getting an instance property should never raise an error, in any situation.


Unless I have made major errors in my process described above, this is not acceptable. A user would require too much knowledge and effort to perform simple tasks, that the programmer can do for them already. In case of sage-analysis, the biggest problem is that the Model and DataClass classes are not connected to each other in any way, even though they require each other in order to function. Besides that, a user would expect that when a class is initialized that serves to handle the data stored in a specific file, that it then also reads that data whenever necessary. It should not be necessary to read that data separately, and also assign it separately, as that completely negates the purpose of said class.

What I recommend doing is going over all the definitions in the entire package, and check everywhere whether it makes sense that user input is required for something. A user should ideally never have to provide any information more than once, directly or indirectly. Making sure that it works that way will significantly improve the user experience.

Let me know if there are any questions that I need to answer, or if more information is required.

jacobseiler commented 4 years ago

Hey @1313e thanks for your thoughts and insights.

I think there's 2 things that should be discussed here.

  1. Resolving how to do what you want (your "situation" dot points),
  2. Comments/thoughts on the package structure

For number 2, I am still processing your comments and thinking which aspects I agree/disagree on. I will make a more in-depth reply or code modifications in a future update. It should be noted that majority of the package was written in 2019 during my thesis write-up, with major rewrites occurring in December last year and again this year in April-ish, with me having pretty much not using the package at all in the meantime. As a result, its not surprising that there is a lot of disconnect and conflicting ideologies as my coding style/priorities changed.


For number 1, I have recently just merged into Master a very big update to the API format. In particular, there is now a more simplified access flow. Your scenario, where you want to compute the stellar mass function, is covered in the documentation. Please let me know if it works/does not work for what you want.

manodeep commented 4 years ago

Fantastic @jacobseiler. And thanks to you @1313e for being the first "external" user to try out sage_analysis. The updates that are feasible and in-scope will make the package much more user-friendly.

@jacobseiler Separately, now that we can run sage from python - what do you think about making the sage_analysis package a sub-package within a sage-model python package?

1313e commented 4 years ago

Someone's got to be the first one, right...?

1313e commented 4 years ago

Alright, I can finally get back to this. Below are some comments while I am working through this part of the documentation.

manodeep commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

1313e commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

manodeep commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

Does that mean you could successfully run sage with PRISM?

1313e commented 4 years ago

@1313e Looking through your comments, I see that most of them pertain to software design and usability. Other than the number-density comment, is there anything that is preventing you from using thesage_analysis package to read in sage output files (for use with PRISM)?

No, there is not.

Does that mean you could successfully run sage with PRISM?

No, as the number density comment has not been dealt with, right?

manodeep commented 4 years ago

@1313e Ahh - yes, of course!

@jacobseiler What's the recommended way to compute the co-moving number-density within sage_analysis?

jacobseiler commented 4 years ago

To compute the number density of galaxies divide the number of galaxies by the simulation volume * the bin widths.

In particular, given a specific model output,


bin_widths = model.bins["stellar_mass_bins"][1::] - model.bins["stellar_mass_bins"][0:-1]
normalization_factor = model._volume / pow(model.hubble_h, 3) * bin_widths  # Assumes Simulation was specified in Mpc/h

for snapshot in snapshots:
            norm_SMF = model.properties[f"snapshot_{snapshot}"]["SMF"] / normalization_factor
1313e commented 3 years ago

Now that OzSTAR is usable again, I can keep going with this. However, most of the comments above are usability comments, and if they would be taken into account, I will have to rewrite my code again as well.

1313e commented 3 years ago

Oh, btw, can I turn off the printing of all the information to screen somewhere? I don't think a user wants to see that a few 1000 times.

1313e commented 3 years ago

Is there any comparison SMF data that I can use to constrain SAGE with? Of course, the data must have matching logM bins as the output of SAGE.

jacobseiler commented 3 years ago

SAGE compares against the Baldry 2008 paper.

    Baldry = np.array([
        [7.05, 1.3531e-01, 6.0741e-02],
        [7.15, 1.3474e-01, 6.0109e-02],
        [7.25, 2.0971e-01, 7.7965e-02],
        [7.35, 1.7161e-01, 3.1841e-02],
        [7.45, 2.1648e-01, 5.7832e-02],
        [7.55, 2.1645e-01, 3.9988e-02],
        [7.65, 2.0837e-01, 4.8713e-02],
        [7.75, 2.0402e-01, 7.0061e-02],
        [7.85, 1.5536e-01, 3.9182e-02],
        [7.95, 1.5232e-01, 2.6824e-02],
        [8.05, 1.5067e-01, 4.8824e-02],
        [8.15, 1.3032e-01, 2.1892e-02],
        [8.25, 1.2545e-01, 3.5526e-02],
        [8.35, 9.8472e-02, 2.7181e-02],
        [8.45, 8.7194e-02, 2.8345e-02],
        [8.55, 7.0758e-02, 2.0808e-02],
        [8.65, 5.8190e-02, 1.3359e-02],
        [8.75, 5.6057e-02, 1.3512e-02],
        [8.85, 5.1380e-02, 1.2815e-02],
        [8.95, 4.4206e-02, 9.6866e-03],
        [9.05, 4.1149e-02, 1.0169e-02],
        [9.15, 3.4959e-02, 6.7898e-03],
        [9.25, 3.3111e-02, 8.3704e-03],
        [9.35, 3.0138e-02, 4.7741e-03],
        [9.45, 2.6692e-02, 5.5029e-03],
        [9.55, 2.4656e-02, 4.4359e-03],
        [9.65, 2.2885e-02, 3.7915e-03],
        [9.75, 2.1849e-02, 3.9812e-03],
        [9.85, 2.0383e-02, 3.2930e-03],
        [9.95, 1.9929e-02, 2.9370e-03],
        [10.05, 1.8865e-02, 2.4624e-03],
        [10.15, 1.8136e-02, 2.5208e-03],
        [10.25, 1.7657e-02, 2.4217e-03],
        [10.35, 1.6616e-02, 2.2784e-03],
        [10.45, 1.6114e-02, 2.1783e-03],
        [10.55, 1.4366e-02, 1.8819e-03],
        [10.65, 1.2588e-02, 1.8249e-03],
        [10.75, 1.1372e-02, 1.4436e-03],
        [10.85, 9.1213e-03, 1.5816e-03],
        [10.95, 6.1125e-03, 9.6735e-04],
        [11.05, 4.3923e-03, 9.6254e-04],
        [11.15, 2.5463e-03, 5.0038e-04],
        [11.25, 1.4298e-03, 4.2816e-04],
        [11.35, 6.4867e-04, 1.6439e-04],
        [11.45, 2.8294e-04, 9.9799e-05],
        [11.55, 1.0617e-04, 4.9085e-05],
        [11.65, 3.2702e-05, 2.4546e-05],
        [11.75, 1.2571e-05, 1.2571e-05],
        [11.85, 8.4589e-06, 8.4589e-06],
        [11.95, 7.4764e-06, 7.4764e-06],
        ], dtype=np.float32)

    Baldry_xval = np.log10(10 ** Baldry[:, 0] / hubble_h / hubble_h)
    if imf == "Chabrier":
        # Convert the Baldry data to Chabrier.
        Baldry_xval = Baldry_xval - 0.26

    Baldry_yvalU = (Baldry[:, 1]+Baldry[:, 2]) * pow(hubble_h, 3)
    Baldry_yvalL = (Baldry[:, 1]-Baldry[:, 2]) * pow(hubble_h, 3)

Baldry_xval is the log10 Msun value and yvalU/yvalL is the upper/lower bounds in units of Mpc^-3 dex^-1.

1313e commented 3 years ago

Those values are incompatible with SAGE, as SAGE does not calculate the SMF at these logM values.

darrencroton commented 3 years ago

Just to clarify ... The SMF needs to be calculated from the data that SAGE generates. This is true for all models (and astronomy data in general). The SMF is not a property of the galaxies.

1313e commented 3 years ago

Just to clarify ... The SMF needs to be calculated from the data that SAGE generates. This is true for all models (and astronomy data in general). The SMF is not a property of the galaxies.

Yeah, I know. But SAGE does not calculate the SMF values at these logM values, therefore making it incompatible for comparison with the given data.

darrencroton commented 3 years ago

I don't understand what you mean. Stellar mass is a galaxy property in SAGE with units 10^10 h-1Msun. That can be converted to the same logM as the Baldry data (i.e. multiply by 10^10, divide by little h, take the log10). Baldry is given that way because it's how the SMF is usually plotted. Does that answer your question?

jacobseiler commented 3 years ago

I think the confusion may be that we're using two definitions of SAGE interchangeably,

Currently, SAGE-analysis is computing the SMF in log10 Msun bins from [8.0, 12.0] in widths of 0.1. Wheras Baldry Is from [7.05, 11.95] in widths of 0.1. I belive Ellert is saying that because these bins are different, it is not valid to compare the SMF produced by default SAGE-analysis and that of Baldry 2008.

1313e commented 3 years ago

Uhm, yes, that is what I am saying. I could not find anywhere where I can set what bins to calculate the SMF for.

jacobseiler commented 3 years ago

Try this. Change the bin_low, bin_high, bin_width values and point to the correct parameter file. It should compute only the stellar mass function in the galaxy_analysis.models[0].properties["SMF"] attribute.

(Yes I am very aware that this is hacky, ill-thought out, badly constructed etc etc)


from sage_analysis.galaxy_analysis import GalaxyAnalysis
from sage_analysis.default_analysis_arguments import default_galaxy_properties_to_analyze

par_fnames = ["/home/Desktop/sage-model/input/millennium.ini"]

properties_to_analyze = default_galaxy_properties_to_analyze.copy()
properties_to_analyze["stellar_mass_bins"] =  {
    "type": "binned",
    "bin_low": 8.0,
    "bin_high": 12.0,
    "bin_width": 0.1,
    "property_names": ["SMF"],
},

galaxy_analysis = GalaxyAnalysis(par_fnames, properties_to_analyze=properties_to_analyze, plot_toggles={"SMF": True})
galaxy_analysis.analyze_galaxies()
1313e commented 3 years ago

properties_to_analyze is not a valid keyword argument of GalaxyAnalysis. I think you mean galaxy_properties_to_analyze. Which could be shortened to just analyze_props.

1313e commented 3 years ago

Is there a way to disable all printing output from both SAGE and sage_analysis?

manodeep commented 3 years ago

@1313e Are you using the cffi branch of sage?

1313e commented 3 years ago

@1313e Are you using the cffi branch of sage?

Yes

manodeep commented 3 years ago

I have an easy but hacky way. After you pull in the latest commit on the cffi branch, add the following line

freopen("/dev/null, "w", stdout);

at the top of the function init_sage and then add this line

freopen("/dev/tty", "w", stdout);

at the end of finalize_sage. That should do the trick by redirecting all standard output into the blackhole of /dev/null and then restoring after sage is done.

(Correctly solving what you are requesting is a bit more involved)