openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
703 stars 36 forks source link

[REVIEW]: CosmicProfiles: A Python package for radial profiling of finitely sampled dark matter halos and galaxies #5008

Closed editorialbot closed 1 year ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@tibordome<!--end-author-handle-- (Tibor Dome) Repository: https://github.com/tibordome/cosmic_profiles Branch with paper.md (empty if default branch): dev Version: v1.4.0 Editor: !--editor-->@adonath<!--end-editor-- Reviewers: @benediktdiemer, @phil-mansfield Archive: 10.5281/zenodo.7934897

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/cc4b94fdf39b15f91d497c0b50b459e0"><img src="https://joss.theoj.org/papers/cc4b94fdf39b15f91d497c0b50b459e0/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/cc4b94fdf39b15f91d497c0b50b459e0/status.svg)](https://joss.theoj.org/papers/cc4b94fdf39b15f91d497c0b50b459e0)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@benediktdiemer & @phil-mansfield, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @adonath know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @benediktdiemer

📝 Checklist for @phil-mansfield

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.56 s (157.7 files/s, 333937.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                7          11672          54722         108893
Cython                          20            586           3791           2670
Python                          26            521           1471           2551
reStructuredText                21            311            301            470
YAML                             6             18             25            168
Bourne Shell                     3              3              0             92
Markdown                         2             13              0             77
TeX                              1              4              0             75
DOS Batch                        1              8              1             26
make                             1              4              7              9
TOML                             1              0              0              3
-------------------------------------------------------------------------------
SUM:                            89          13140          60318         115034
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 513

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1111/j.1365-2966.2009.15715.x is OK
- 10.3847/1538-3881/aadae0 is OK
- 10.3847/1538-4365/aaee8c is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

adonath commented 1 year ago

👋🏼 @tibordome @benediktdiemer @phil-mansfield this is the review thread for the paper. All of our communications will happen here from now on.

As a reviewer, the first step is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread.

These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention openjournals/joss-reviews#REVIEW_NUMBER so that a link is created to this thread (and I can keep an eye on what is happening). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me (@adonath) if you have any questions/concerns.

benediktdiemer commented 1 year ago

Review checklist for @benediktdiemer

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

benediktdiemer commented 1 year ago

Just added issue https://github.com/tibordome/cosmic_profiles/issues/1 (cannot build cosmic-profiles on my computer).

adonath commented 1 year ago

Happy new year everyone! This is just a quick reminder for @benediktdiemer and @phil-mansfield to continue their review. Usually JOSS aims for a short review phase between 2-4 weeks, so let's try to aim for concluding the review soon.

@tibordome Can you please help resolving the installation issues?

Please also let me know whether there are any open questions or anything else I can help with. Thanks!

tibordome commented 1 year ago

Happy new year!

@benediktdiemer I have commented on the issue you have raised.

phil-mansfield commented 1 year ago

Hi all, sorry for dropping off the map for a bit. There were health issues outside my control which prevented me from working for several weeks. Now that this has been cleared up, I look forward to starting up again.

adonath commented 1 year ago

Thanks @phil-mansfield, good to hear you doing well again!

adonath commented 1 year ago

@phil-mansfield Please start your review using: @editorialbot generate my checklist

phil-mansfield commented 1 year ago

Review checklist for @phil-mansfield

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

adonath commented 1 year ago

Just for the record: there is an unresolved installation issue, which prevents @benediktdiemer to install and run and review the software. @tibordome recommended a solution, which remains to be tested.

@benediktdiemer Please let me know in case you won't be able to proceed with the review any time soon (let's say if you need more > 2 weeks). I think we should make sure to have regular progress in the review process. In order to save time I could also try to look into the installation issue, or maybe find an expert from the JOSS side to help. Thanks!

benediktdiemer commented 1 year ago

That's right. It's on my to-do list of course, but I've been unusually busy. The proposed solution (sym-linking my system compiler to something else) obviously has the potential to destroy my system if I don't do it right, so I need to take a little time deal with it. I do hope to get to it within the next two weeks.

benediktdiemer commented 1 year ago

Alright, I've finally had time to take another stab at this -- sorry for the delay. I still cannot install cosmic-profiles on my macbook, even after disabling SIP and changing the /usr/bin/cc link manually. Even though pip uses gcc then (for some reason), it fails (I've checked that gcc can compile OpenMP). I've given up on getting cosmic-profiles to work on a Mac, and instead installed it on a different (unix) server.

I successfully ran some cosmic-profiles routines on the z = 0 snapshot of IllustrisTNG100-3-DM -- it worked more or less right out of the box! In particular, I managed to measure the shapes of the 10 largest halos and plot the shape parameters as well as visualizations, following the instructions in the docs. Unfortunately, I did not manage to run the examples for density profiles. Here, the interface is fairly complex. I got lost/stuck where the documentation says

"The first argument dens_profs_fit is an array of shape containing the density profiles defined at radii r_over_r200_fit, possibly obtained via getDensProfsDirectBinning() or getDensProfsDirectBinning(), with some non-reliable values removed."

The two suggested functions are the same, but they are not documented, and it is not clear how "non-reliable" values should be removed (or what they are). It would be much better to provide a fully worked set of commands that the user can test, and/or complete the documentation. By looking at the example script, I realized that the r_over_r200 array can just be generated with np.logspace or whatever.

Finally, I tried executing the "Gadget HDF5 Example" script. I could mostly run it, but it still fails on the density profiles part (meaning I still couldn't quite verify the density profiles functionality). I list some of the issues below. I have started confirming most items in the checklist above. Here is a list of issues that I'd like to see addressed before confirming the rest of the checks:

Issues (code):

638 s: Starting fitDensProfs() with snap 099
best_fits = cprofiles.fitDensProfs(dens_profs[:,25:], ROverR200[25:], method = 'nfw', select = halos_select)
OSError: [Errno 12] Cannot allocate memory
(abbreviated)

NB this machine has 192 GB memory or so... not sure why this crashes. My suggestion is to try and design a minimal example script that can be run with reasonable resources (as long as the halos aren't too large).

Issues (paper/docs):

Suggestions for the future:

adonath commented 1 year ago

Thanks a lot @benediktdiemer for the thorough review!

I wasn't sure whether there are any automated tests, as required by the checklist above?

I think tests are located here: https://github.com/tibordome/cosmic_profiles/tree/master/cosmic_profiles/tests

benediktdiemer commented 1 year ago

Ah I missed that, just clicked the corresponding checkmark

tibordome commented 1 year ago

@benediktdiemer many thanks for your very helpful comments and suggestions!

I'll implement / address them as quickly as I can.

tibordome commented 1 year ago

I should have addressed all comments / suggestions in version 1.3.1.

@adonath A built wheel (.whl file) is provided for each version of the package on the pypi index, in addition to the source code (tar.gz file)

benediktdiemer commented 1 year ago

Thanks for making these changes @tibordome ! I updated my version and it all looks good now. I'm satisfied as far as this review goes.

adonath commented 1 year ago

Thanks a lot @tibordome for the update! This seems like multiple considerable improvements to the package. Proving wheel should also resolve the installation issues!

Thanks a lot @benediktdiemer for you persistency and finishing the review!

@phil-mansfield please take some time in the coming week and finish your review as well.

Once both reviews are finished I might submit some final comments from my side on the paper text. But I'm positive we can finally finish the review soon!

tibordome commented 1 year ago

That will be great. Perhaps there is some way to contact @phil-mansfield?

phil-mansfield commented 1 year ago

I am working on this now: my apologies.

phil-mansfield commented 1 year ago

Since Benedikt has already tested out running this on the pre-packaged particle formats, I tried my hand at getting this to run on some zoom-in data via DensShapeProfs. I haven't managed to get this to work yet, but here are my notes so far. Of the things I mention below, only the last is something that's stopping me from continuing.

Installing

I was able to install the software without much difficulty on one of Stanford's clusters after some version updates. It seems that there's a version requirement where a sub-package called scipy._lib needs to exist which isn't enforced by pip. I don't know what version of scipy this corresponds to, but the most recent one is fine. I haven't been successful in getting it to work on my mac laptop yet, but haven't tried too hard to debug that yet.

When importing cosmic profiles on our cluster, if I do it within a scheduled SLURM job, I get the error message

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[rome0182:88971] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

I am able to successfully import the code from a login node (although it takes some time, close to a minute the first time I import it during a session and about 6 seconds after that).

Supported Data Structures page

The "SNAP" variable is defined twice in the second code block

The comment on L_BOX in the code blocks says it's in cMpc/h. When I read this I wasn't sure if it was in units of length_in_cm, of if all box sizes needed to be in cMpc/h, although this is specified later.

Not all simulations are run in periodic boxes. Also, in zoom-in simulations it's common to recenter the simulation on the main halo (i.e. so the halo is at (0, 0, 0) at z=0). I was testing this on some zoom-in data and needed to manually add in some arbitrarily large offset. This is not a big deal, though.

For updateInUnitSystem and updateOutUnitSystem:

The class initializers have a very large number of arguments. I think it would be best to decrease this: it would make it less overwhelming to new users and less likely for people to put in errors. Might be best to make D_LOGSTART, D_LOGEND, D_BINS, IT_TOL, IT_WALL, IT_MIN, CENTER, and MIN_NUMBER_PTCS named arguments. Additionally, it seems like the SNAP, GROUP_DEST, and VIZ_DEST arguments might only be needed if the user calls certain functions to dump the binary profiles/plots. I'd recommend attaching them to the I/O functions that need them.

The shape of xyz isn't specified. (i.e. is it (3,N) or (N,3))

When I try to initialize DensShapeProfs with the example code in the documentation, I get the error

TypeError: __init__() takes exactly 16 positional arguments (14 given)

I think it's missing VIZ_DEST and CAT_DEST based on the docstring for the initializer.

My data was originally stored as doubles and this led to an error when passed to DensShapeProfs.

ValueError: Buffer dtype mismatch, expected 'float' but got 'double'

The required type isn't specified in the documentation, and it might be best to call, e.g. np.asarray() to do the type conversion internally.

The docstring for the class's intializer says that D_LOGSTART and D_LOGEND must be integers, but I don't think this is mentioned in the online documentation. (Additionally, I think a user could pretty reasonably want the logarithmic bins to start/end at something other than a power of ten)

With these issues fixed, I get the following error when I try to call the initializer.

    cprofiles = DensShapeProfs(xyz, mass_array, idx_cat, r_vir, SNAP, L_BOX, MIN_NUMBER_PTCS, D_LOGSTART, D_LOGEND, D_BINS, IT_TOL, IT_WALL, IT_MIN, CENTER, VIZ_DEST, CAT_DEST)
  File "cosmic_profiles/shape_profs/shape_profs_classes.pyx", line 459, in cosmic_profiles.shape_profs.shape_profs_classes.DensShapeProfs.__init__
  File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__

I've checked the .flag field for the three numpy arryas I'm passing to DensShapeProfs (xyz, mass_array, and idx_cat) and each of them has WRITEABLE = True. Do you know what's happening here?

tibordome commented 1 year ago

Many thanks for the suggestions and comments, I'll start addressing them.

I think the

Local abort before MPI_INIT completed completed successfully

error message might be an MPI issue on the cluster itself, based on this thread. I usually do conda install -c conda-forge mpi4py before the actual install (either pip or conda) on any machine, perhaps I should add that to the documentation.

In a completely empty environment on a Linux machine, I can do pip install --user [cosmic-profiles](url) with no issues regarding scipy. On which platform does this occur?

I assume the actual error message is

ValueError: buffer source array is read-only

I've read somewhere that this might occur if the arrays are too large, but I increased the arrays to 1 billion \times 3 (for dm_xyz), and still no such error. I've tried on a cluster here in Cambridge and I couldn't reproduce it there either. Could you try this example script that you will find attached? ex_script.zip

phil-mansfield commented 1 year ago

I will try these out later today. Also, I forgot to mention this earlier, but all the comments about function style and that type of thing are pure suggestions. I trust your judgement on whether or not you want to follow them and it's okay if you end up deciding to ignore them. I'll demarcate those more clearly going forward.

phil-mansfield commented 1 year ago

Another quick suggestion. This would probably be a lot of work, so it's definitely more something to keep in mind for the future rather than something that makes sense to change now. It seems like the library is directly exposing Cython classes and functions to normal Python users. There's nothing wrong with this, but it means that it's pretty easy to accidentally stumble into a pretty difficult to understand/debug type errors, and also means that there's input which the library could sanitize and use that instead leads to a crash. What I personally like to do is wrap all my Cython functions in short Python wrappers than handle all the type conversions and checking.

If you wanted to see an example of this, here's a really simple Cython library I wrote a couple years back which packs integers with an arbitrary number of bits into a contiguous byte array. The Cython code is here and here's the wrapper Python.

RE: MPI

I have mpi4py 3.1.4 installed. I haven't noticed issues with it on the cluster in the past, but maybe it's related to the specific functions different programs are calling. Do you know if there's an MPI version requirement?

There's definitely something wacky doing on: it's odd for something to work on the login node but not on SLURM, but I've also never seen this happen with other MPI programs. I'll ping the cluster staff and see if they know. I'll get back to you on this if it's related to anything on your end.

RE: scipy

It makes sense that it works on an empty environment. In that case pip will just grab the latest version of scipy. The issue would be installing cosmic_profiles on a machine that already has an old version of scipy. I think you just want to figure out a version that doesn't have the problem anymore and add that as a scipy version requirement.

I foolishly didn't check what version of scipy I was running before updating it (it was about a year old), but the current version certainly works. There are some old stack overflow discussions about a similar message https://stackoverflow.com/questions/50418047/import-error-no-module-named-scipy-lib, but these predate when I installed scipy, so maybe this is a periodic problem.

RE: read-only

The issue was that I was accidentally passing a scalar for r_vir instead of a length-1 array.

I'll continue working on this can get more notes to you soon.

tibordome commented 1 year ago

I have implemented some suggestions on the dev branch already:

Hiding the cython functionalities is a good idea, many thanks for drawing my attention to that and your packages you mentioned. My cython classes that are exposed to the user are only nominally cython classes, the real cython functionalities are hidden. I can imagine that having a Python wrapper for each cython function might cause some overhead in terms of lines of code / readability etc, but I do now think exposing cdef classes is maybe not the smartest option. I'll try to avoid this in the future.

I'll have to check the MPI version requirement. I'll also see about scipy requirements.

I had VIZ_DEST and CAT_DEST tied to class methods in earlier versions but I found it slightly annoying to carry it around all the time, not just in the code internally (instead of self. variables of the class), but also when calling the methods as a user. Many methods would need SNAP, VIZ_DEST and CAT_DEST. I think the use case of allowing for multiple VIZ_DEST and CAT_DEST folders would be rather limited.

Speaking of use cases, I can imagine that someone would like to run two instances with different unit systems at the same time, but I think working with one simulation per time should be sufficient for most users. Not all the functions are class functions, and I am using the global unit variables fairly often (which should be avoided in general), so restructuring would take some time. Regarding different scale factors, the unit system in the code does not care about physical vs comoving units. Since it works on individual objects such as halos, the density and shape profiles vs the normalized radius remain unchanged.

In zoom-in simulations, L_BOX can be set to zero. The functions that ensure that PBCs are respected are in that case not doing anything. I am not sure I understand the offsets you mentioned.

tibordome commented 1 year ago

On Linux, pip seems to be less sensitive to the scipy version that is installed, and sometimes upgrades scipy automatically,

can occur sometimes. I have added to the docs that in general scipy==1.10.1 is a requirement though pip works (no install nor runtime issues) down to scipy==1.6.1.

phil-mansfield commented 1 year ago

Ah, oops, missed the comment in the documentation about L_BOX=0. That's my bad. I agree with you about the effort required to implement some of those suggestions: I think it's fine to keep things as-is.

phil-mansfield commented 1 year ago

I'll premark comments with an (S) to denote things that are pure suggestions (usually stylistic comments)

Comments related to getShapeCatLocal:

1: I encountered some problems related to translation invariance. I have a halo that I read in from my zoom-in simulations that's centred on ~[0, 0, 0] kpc with a virial radius of ~250 kpc. When I run getShapeCatLocal on it, I get a centre of [ 0.07100291 0.01147768 0.20005868] kpc (good!), but if I add 20,000 kpc to the x, y and z positions of the halo, I get [20490.355 20480.424 20485.807] kpc instead. So the new centre I'm getting is off by about 2.5%, which ends up being well outside the virial radius of the original halo. This was with a 3 million particle halo. I tried the same trick with a 200 particle halo and its centre was translation invariant. If I set L_BOX in your example code to largeish numbers, I can get the same effect (e.g. L_BOX = 100). Can you see if you can replicate this?

If I had to guess, I'd say that the issue was related to the resolution of a 32-bit flaoting point number. The floating point mantissa has 23 bits, so you'll have a dynamic range of 1/2^24, meaning about 1-in-16-million. So positions can still be represented accurately, but sums of a large numbers of positions can't. So, for example, say you were taking the center of mass by adding up x for all 3 million particles and dividing by 3 million, after the first couple hundred thousand particles are added, every subsequent x can't be fully represented with the dynamic range of the sub. Averaged across a large population, this leads to a systematic bias. (I'm using CENTER="mode", so it probably isn't this exact issue).

I think the way you want to handle this is to make sure that you never have any variables which are raw sums of particle properties. Let me know if you'd like to talk about specific strategies that avoid this type problem. Approaches vary depending on what you want to do.

2 (S): I'd recommend revisiting how skipped (np < MIN_NUMBER_PTCS) haloes work. Right now, the user supplies an idx_cat array with all their haloes, but the r_vir array needs to correspond to the number of objects with np >= MIN_NUMBER_PTCS, so the user needs to manually filter on the minimum number of particles anyway. Additionally, this means that the returned array is a different length than idx_cat, meaning that if the user wants to cross-reference shape profiles with the the particles or some pre-existing halo catalogue later, the indexing becomes complicated.

I'd recommend doing one of two things instead: (1) just letting the user decide which objects they want to pass and not handling the min particle cuts yourself. (2) using rvir and idx_cat of the same length and not completely removing haloes below the particle limit but filling all their properties with NaNs. This would make sure the indexing stays easy between input and output arrays.

3 (S): This function returns a large number of values, many of which have similar forms to one another. I think it makes the return signature a bit overwhelming and easy to make mistakes on. (I think the user would probably need to look up the documentation every time they call it rather than being able to remember the return signature). There are a couple potential tricks you can use to clean things up a bit. The first is returning an object where the fields are the different return quantities. This is, e.g., what the scipy.optimize routines do (e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html#scipy.optimize.minimize_scalar).

Another alternative that I like is structured numpy arrays. These are numpy arrays that tie various fields together and use dictionary-like field accessing semantics. So, e.g., if you had a structured array called shapes, you could get the minor axis vector in the 12th shell for the 20th halo with shapes["minor"][20,12] and can get all the minor vectors in the 12th bin with shapes["minor"][:,12], etc. One benefit of these is that you can perform filtering a single time for all the fields simultaneously, so, e.g. if you wanted to do a bunch of analysis on the 20th halo but wanted to remove unconverged bins, instead of needing to manually filter every bin, you could do something like

shape = shapes[20]
is_conv = np.~np.isnan(shape["d"])
shape = shape[is_conv]

In your case, I think you'd probably want one array for the profiles and one for the halo bulk properties. So something like this: (I haven't tested this code)

SHAPE_PROF_DTYPE = [("d", "f4"), ("s", "f4"), ("q", "f4"), ("minor", "f4", (3,)), ("inter", "f4", (3,), ("major", "f4", (3,)]
OBJECT_PROPERTIES_DTYPE = [("centre", "f4", (3,)), ("mass", "f4")]
shapes = np.zeros((n_pass, D_BINS+1), dtype=SHAPE_PROF_DTYPE)
objs = np.zeros((n_pass, D_BINS+1), dtype=OBJECT_PROPERTIES_DTYPE)
# etc.

Then the function would return

shapes, obj = getShapeCatLocal()

4 (S): If you do end up making this into structured arrays or an object, I'd recommend adding a field called something like "ok", which is true if the shape array is non-NaN at that index and false otherwise. It'll make people's code cleaner and a lot of people don't know how to test whether a quantity is NaN (e.g., they might write is_converged = d != np.nan).

tibordome commented 1 year ago

Regarding translation invariance: I do not think this is due to floating point arithmetic but due to

def respectPBCNoRef(xyz, L_BOX = None):
"""
Return modified positions xyz_out of an object that respect the box periodicity

If point distro xyz has particles separated in any Cartesian direction
by more than L_BOX/2, translate those particles accordingly.

:param xyz: coordinates of particles
:type xyz: (N^3x3) floats
:param L_BOX: periodicity of box (0.0 if non-periodic)
:type L_BOX: float
:return: updated coordinates of particles
:rtype: (N^3x3) floats"""
if L_BOX != 0.0:
    xyz_out = xyz.copy() # Otherwise changes would be reflected in outer scope (np.array is mutable).
    ref = 0 # Reference particle does not matter
    dist_x = xyz_out[:,0]-xyz_out[ref, 0]
    dist_y = xyz_out[:,1]-xyz_out[ref, 1]
    dist_z = xyz_out[:,2]-xyz_out[ref, 2]
    xyz_out[:,0][dist_x > L_BOX/2] = xyz_out[:,0][dist_x > L_BOX/2]-L_BOX
    xyz_out[:,0][dist_x < -L_BOX/2] = xyz_out[:,0][dist_x < -L_BOX/2]+L_BOX
    xyz_out[:,1][dist_y > L_BOX/2] = xyz_out[:,1][dist_y > L_BOX/2]-L_BOX
    xyz_out[:,1][dist_y < -L_BOX/2] = xyz_out[:,1][dist_y < -L_BOX/2]+L_BOX
    xyz_out[:,2][dist_z > L_BOX/2] = xyz_out[:,2][dist_z > L_BOX/2]-L_BOX
    xyz_out[:,2][dist_z < -L_BOX/2] = xyz_out[:,2][dist_z < -L_BOX/2]+L_BOX
    return xyz_out
else:
    return xyz

which chooses a particle randomly (ref=0) and translates other particles so that halos at the edges of a box appear as whole objects again. If the new center reported by CosmicProfiles is outside the box, e.g. [20490.355 20480.424 20485.807], then I think I should manually ensure within the code that this is translated back into the box. Is that the suggestion?

Regarding point 2, skipping halos: Actually r_vir supplied by the user should have the same length as idx_cat (i.e. provided for non-skipped and skipped objects). All the returned quantities, on the other hand, will work with N_pass (those that have np larger or equal to MIN_NUMBER_PTCS). I could provide len(nb_halos) shaped return values with lots of NaNs. However, I think determining which objects pass and which do not is a one-liner for the user. Also, the meaning of NaN would be overloaded since in addition to the non-convergence of the algorithm it could mean a not-sufficiently-resolved object. Alternatively, I could provide a functionality getIdxsPass() to provide this array upon request.

Points 3 & 4: I have adopted these suggestions on the dev branch. Very helpful indeed and it makes the return signatures much more readable. Many thanks! I was not aware of structured numpy arrays before. I think I settled on is_conv as the field that tells the user whether it converged, and shapes = getShapeCatLocal() only (rather than shapes, obj = getShapeCatLocal()) , but with obj = getMassesCenters() all the necessary information can be retrieved.

phil-mansfield commented 1 year ago

Glad to hear suggestions 3 & 4 were useful. Structured arrays are such a cool feature: they're basically very light-weight pandas dataframes: very addicting once you get used to them. (Or maybe pandas dataframes are more heavy-duty structured arrays? Not sure which came first.)

For point 1, I don't think this can be the cause of my issue. My apologies if I under-specified what I was doing. When I run this on my zoom-in halo's particles, I set L_BOX = 0.0, so this particular block of code never executed. I also don't think it can be a strict geometric issue: if I subsample my zoom-in halo to have a few hundred particles, I no longer have this issue.

I think the thing that I said which was confusing was that I can get the same problem with the example code you sent me a few days ago by changing L_BOX and the number of particles in the randomly-generated halo. The code shifts the halo into the middle of the box, so changing L_BOX is effectively just shifting the halo to larger offsets. Here are the inputs I used and what I got back:

L_BOX = np.float32(0.0) and halo_res = int(3e6): Accurate centres L_BOX = np.float32(100.0) and halo_res = int(3e6): centres which are off by ~0.8, which leads to NaN shape profiles L_BOX = np.float32(100.0) and halo_res = int(3e5): Accurate centres L_BOX = np.float32(1000.0) and halo_res = int(3e5): centres which are off by about 0.3, leaning to NaN shape profiles

To me this seems consistent with floating point errors. Can you check if you can replicate this?

I think one hacky way to deal with this would be do a trick similar to what you do in respectPBCNoRef, select a random reference point within the halo (actually, I'd select a bunch of truly random points and take the median to make sure you get something close to the centre: it's an old trick people use when implementing quicksort), producing delta_xyz = xyz - ref_xyz and doing all your centre-finding and shape analysis on that. Then when you report centres back to the user, you add ref_xyz back.

I think that would fix the issue for any halo with randomly-ordered particles (like your synthetic halo and like the formatting used for my zoom-in haloes). I think you could still get yourself in trouble if someone had stored particles along a hilbert curve rather than randomly and had a ton of particles, but I'm not sure it's worth worrying about working on such large objects unless you want to. The way you'd want to handle this is that instead of, e.g., taking averages by adding up a bunch of numbers and dividing at the end, you have a running average which is updated with each element. You could also gain some resolution by using float64 internally.

(Also, do you have some later code which makes sure that halo centers get put back within L_BOX? E.g., suppose that the L_BOX = 100, ref is at x = 99.9 and the halo centre is at x = 0.1. Does the centre get reported at x = 0.1 or as x = 100.1?)

tibordome commented 1 year ago

I managed to reproduce the error. I checked and removed the few apparent raw particle sums such as mass_total += masses[0]. I followed your advice and improved the calculation of CoM centers by recentering with respect to some random reference points, delta_xyz = xyz - ref_xyz

def calcCoM(xyz, masses):
    """ Calculate center of mass of point distribution

    :param xyz: coordinates of particles of type 1 or type 4
    :type xyz: (N,3) floats
    :param masses: masses of the particles
    :type masses: (N,3) floats
    :return: com, center of mass
    :rtype: (3,) floats"""
    # Average over some random particles and recentre with respect to that to avoid large numbers
    rng = default_rng(seed=0)
    choose = rng.choice(np.arange(len(xyz)), (min(50,len(xyz)),), replace = False)
    ref_xyz = np.average(xyz[choose], axis = 0)
    delta_xyz = xyz.copy()-ref_xyz
    mass_total = np.sum(masses)
    com = np.sum(np.reshape(masses, (len(masses),1))*delta_xyz/mass_total, axis = 0)
    com = com+ref_xyz
    return com

This alone could not resolve the floating point issues that accumulated with the conversion of units in the code to internal units and subsequent analysis, then back-conversion etc. What did resolve all such issues was to promote all float32 arrays to float64, float to double etc. as you suggested. I haven't looked into particles along a hilbert curve and other complicated configurations of points. I have pushed the improvements to the dev branch.

I am re-centering all objects now into the [0,L_box]^3 box when referring them to the user, many thanks for pointing that out.

Let me know whether you have more suggestions, I can make a release then combining all the new additions.

phil-mansfield commented 1 year ago

Wonderful. I'll continue working through the rest and will get back to you soon.

tibordome commented 1 year ago

Amazing. Thank you.

adonath commented 1 year ago

Thanks @phil-mansfield and @tibordome for continuing the discussion, I think we are on a good track to finish review soon!

phil-mansfield commented 1 year ago

I am still waiting to hear back from my computing cluster IT people about the MPI issue. The first set of people I spoke to didn't know and it's been bounced around to others.

Additional thoughts:

  1. (S) Personally, I would have split up the arguments to the shape profile constructors and the getShapeCat* methods differently. An alternative structure for the library would be to just have single functions that handle each of the profile tasks rather than an object+method split, but the way you've structured it allows you to split up the I/O and particle management from actually specifying what type of profile you want and how you want to calculate it. This is a good idea! It means someone could, e.g., try out different ways of calculating shapes without needing to reload the particles or resupply halo information. However, some of the arguments that specify how the profiles should be calculated are in the methods (e.g. reduced weighting) and some are in the constructor (radial bins, iteration counts). This means that if you want to tinker with the specifics of how you're calculating shape profiles (say, if I wanted to overplot a couple different binning choices to make sure I didn't have too much noise at small radii) sometimes you can just call the method again and sometimes you need to cal the constructor again (and potentially repeat all the I/O). I'd recommend putting everything that specifes the details of the profile in the method.

  2. (S) Regardless of how you want to handle this, I'd recommend making the call signatures of the density profile mthods/constructors as close to the shape profiles/constructors as you can. The biggest difference I see is that radial bins are handled differently. In the density profiles, radial bins are passed in to the method and the user can specify whatever bins they'd like, while in the shape classes, the user is forced to use a particular binning scheme and has to specify it in the constructor. I'd recommend using the same technique in the shape classes.

  3. (S) Is there any technical reason that the shape and density profiles can't use the same constructors? I think it would simplify things. The main thing that I see is that the density constructor doesn't need the iteration arguments, but those can be passed to the shape calculation methods as optional keyword arguments instead.

On the other hand, with direct_binning = False, we perform a kernel-based density profile estimation, cf. Reed et al. 2005. Kernel-based approaches allow estimation of profiles without excessive particle noise.

Can you please say some more about what is being done with this KDE method? Reading Reed's appendix, it seems like it's a Gaussian kernel whose size is given by some power law and it's unclear to me how they decided that this was a good power law to use. KDEs are the type of thing you often need to fiddle with, so it might be good to give the user some control over the details. (Also, this is a bit pedantic on my side, but I don't think KDEs reduce the noise in density profiles. They smooth out the profile and remove the visual features that allow us to tell that the profile is noisy, but they won't change the statistical uncertainty associated with Poisson noise. This doesn't mean that they're bad to use.)

  1. When citing the source papers for the different profiles you can fit, there's no year for the NFW citation and I think Zhao et al. 1996 is the right citation for the generalized NFW profile https://academic.oup.com/mnras/article/278/2/488/951933

  2. (S) When fitting profiles, it might be nice to allow users to fix a parameter, e.g., people often fix alpha to some value they like when fitting Einasto profiles, or manually set one of the generalized NFW profile indices to something.

  3. Can you please specify how the fits are being performed (e.g. non-linear Levenberg-Marquardt, MCMC, manual likelihood maximization, etc.) and what error model is being used (e.g. are you using uniform log-error bars, are you estimating error bars from Poisson statistics, etc.)

  4. (S) For choosing an inner converged radius of a profile, I think Ludlow et al. 2019 is probably the most cutting edge reference https://ui.adsabs.harvard.edu/abs/2019MNRAS.488.3663L/abstract

  5. Can you please explain the specifics of how your mock halo generation algorithm works? Based on the description, I would have guessed that it would be sampling the density profile, i.e., in the spherically symmetric case, the probability of a point being generated at r is proportional to r^2 rho(r), (and in the ellipsoidal case, it'd proportional to r_ell^2 rho(r_ell)) then some random uniform angles would be generated at that radius. However, the point distribution is very structured in the plots included on this page, which makes me think that something else is being done (just eyeballing it, it looks like you analytically figure out how many particles to generate at discrete r_ell's, then have some deterministic scheme for spreading them out over the ellipsoidal shell).

  6. (S) The reason I'm asking about this is that personally I think most users who are interested in generating mock point distributions would want randomly generated points. One of the main things I do with mock haloes is test out the impact of Poisson noise and do sanity checks on algorithms that need to be tested on realistic point distributions. With a scheme like the one I'm guessing is being used here, neither of those things will be possible. Additionally, if particles are being put at discrete, fixed r_ell, it means that looking at the halo properties with bins that aren't exactly aligned with the generating bins will lead to weird spikes and artefacts.

(Also, I've never seen this published before but I'm pretty sure that the iterative shape routine that everyone likes to use biases to low c/a ratios for real point distributions, but not for ones generated at a fixed r_ell shells. This isn't a critique of writing a library that uses this method, since it's so wide-spread already, but it means that people trying to understand the limitations of their measurements might be misled.)

I'd recommend adding a method that generates a random halo. The way to do this would be to use inverse transfer sampling as a function of r_ell, randomly generating theta & phi within the r_ell-based coordinate system, then transforming all the points to normal cartesian coordinates. I'm guessing you're already doing the last of those three steps. You can do the first by by generating a mass uniformly at random between (0, M(r_max)), and inverting M(r_ell) to get r_ell(M) so this turns into a randomly generated r_ell that follows the correct density profile. (It's expensive to manually integrate and invert M(r_ell), so you could just do it once and interpolate). Then you can get random angles with this method: https://mathworld.wolfram.com/SpherePointPicking.html.

If you're interested in generating random point distributions, but are confused by that description, let me know. I think I have some code that does this sitting on a old laptop somewhere: I can track it down and send it to you as an example if you want.

  1. (S) I think the typical physicist would be confused by what the caching code is doing. I'd recommend backing up a bit and describing it at a level that someone who hasn't written code in a memory-managed language/taken a systems class would understand.

Also, I was a bit confused by what exactly the use_memory_up_to argument was doing. On first read I thought it was the size of the cache, but it's actually the size of the things /not/ in the cache. Maybe I'm unfamiliar with psutil, but reading through the docs it seems like the cache size is scaled by the amount of unallocated but available memory. Is that correct? The amount of unallocated memory will change s the program runs (and as the cache expands). Does this get recalculated dynamically? Also, sometimes when running on a cluster, you can specify how much memory you want to use (which is smaller than the actual amount of RAM on the node you're running the job on) and if your program ever goes over that limit, an automated task sees that and manually kills your job. It's possible that this scheme would mean a user running code on such a setup would find it hard to limit the size of their cache correctly.

I'd recommend just having the user specify the cache size directly.

tibordome commented 1 year ago

1) That is a fair point. I have moved the ROverR200, IT_TOL, IT_WALL, IT_MIN, REDUCED, SHELL_BASED parameters away from the constructor of the shape profiling class, bundled them into one dictionary, katz_config, and transferred them into the respective methods that need them.

2) I have replaced D_LOGSTART, D_LOGEND and D_BINS by ROverR200, to match the density profiling class.

3) Since internally the code is relying on multiple levels of inheritance, the shape profiling class had all functionalities of the density profiling class, + shape profiling. Indeed, it was thus redundant to provide two interfaces to the user. Any profiling now uses one constructor.

4) On page 85 of Reed et al 2005, they motivate the choice of the power law: "To calculate density profiles without excessive particle noise, we developed a novel kernel-based algorithm1 (Merritt & Tremblay 1994); see Appendix for a full description. Both the width and the shape of the kernel are varied with radius; the variation in shape is significant near the origin, where a symmetric kernel would ‘over- flow’ the r = 0 boundary. The window width must be carefully chosen to reduce Poisson noise (‘variance’) without oversmooth- ing (‘biasing’) the profile. In general, it may be shown (e.g. Scott 1992, p. 130) that when the window width is chosen to minimize the mean square error of the estimate, most of the error will come from the variance. Window widths large enough to eliminate the ‘wig- gles’ will generally bias the slope. In addition, the window width should vary with local particle density (Abramson 1982), roughly as ρ−1/2 . We used a kernel window width that varied as r 0.5 set at h0.1 = 0.005r vir at 0.1r vir as it yields profiles and profile slopes in good agreement with binned profiles created with TIPSY 2 ), and it preserves the central cusp and major substructure."

Hence I also choose $h(r) = 0.005 \left(\frac{r}{0.1r{\text{vir}}}\right)^{0.5}r{\text{vir}}$. I wouldn't know over which parameter to give control to the user here.

5) Added

6) Added. E.g. if user provides method = {'profile': 'einasto', 'alpha': 0.18}, then alpha will be kept fixed.

7) The fits are performed using a truncated Newton (TNC) algorithm, which was working best I thought. Errors (shaded regions) are simply estimated (if there is more than 1 profile) based on 25 and 75% quantiles in each shell.

8) Added.

9) I was inspired by Zemp et al 2011, https://arxiv.org/abs/1107.5582, who on page 3 mention "For the generation of a uniform distribution of points on a surface of an arbitrary-shaped ellipsoid, which is needed for setting up an ellipsoid with a given density profile, a method as outlined in Section 2.5.5 of Rubinstein & Kroese (2007) is used. "

So in general, I create a mock halo of mass tot_mass consisting of approximately res particles. The model_pars array contains the parameters for the profile model given in profile. I first determine the number of particles in each shell such that the total number of all particles is close to res. Then I draw points uniformly from an ellipsoidal shell volume, centered at the origin, using drawUniformFromShell(). The scheme of spreading the particles out over the shell is not deterministic.

10) I am not too sure what a real random halo would be like compared to 9). "Additionally, if particles are being put at discrete, fixed r_ell, it means that looking at the halo properties with bins that aren't exactly aligned with the generating bins will lead to weird spikes and artefacts." I do not think there would be any spikes. The mock halo is such that the constraints tot_mass, res, model_pars, profile, a, b, c provided by the user are followed as closely as possible, but there are no empty shells etc. Zemp et al 2011 has quantified (again in https://arxiv.org/abs/1107.5582) the biases incurred in the respective iterative algorithms.

I have looked at the suggested random halo generation implementation, but I cannot quite figure out what would be provided by the user (e.g. resolution of halo?, total mass of halo?, a, b, c, model_pars, profile, r_max?), and what is calculated. If I sample mass uniformly at random between (0, M(r_max)), then isn't it possible that one single particle already has all the mass of my halo of target mass M(r_max)? Also I think https://mathworld.wolfram.com/SpherePointPicking.html does not explain how to sample angles uniformly in an ellipsoidal coordinate frame. If my halo is very elongated sampling from a sphere will deposit most particles at the 'ends of the cigar' so to say.

11) The caching functionality is a hodgepodge of various things, and I'm not claiming the use_memory_up_to works in all cases, but usually it did behave as expected on my machine at least. I have added some more explanations, including of maxsize. I have added that if a user wants to minimize the RAM used by CosmicProfiles, use_memory_up_to should be False and maxsize should be zero. The amount of free RAM available should get recalculated dynamically I think. I cannot change use_memory_up_to and replace it with the RAM used by the cache, since the functionality relies on this crucial line:

``full = (psutil.virtual_memory().available < use_memory_up_to)``

The cache size (in terms of number of objects) is specified by maxsize.

phil-mansfield commented 1 year ago

Great!

RE: 4 - Right, I read through Reed et al 2005. I think the key part is in their appendix is this:

The optimal way to vary hi is according to Abramson’s (1982) rule: hi ∝ ν−α(ri), α = 1/2. (A7) Since we don’t know ν(r) a priori, we instead varied hi ac- cording to hi ∝ rβ . (A8) We found that β = 1/2 gave good results, which is reason- able since density profiles are close to ν ∼ r−1. One could improve on this by first constructing a pilot estimate of ν then using Abramson’s rule.

Which I read as meaning that they just checked a couple values of beta and found that it worked pretty good for their simulations and for the part of the profile that they were interested in. I think it's fine to leave this as-is, but if you wanted to give people some control you could allow them to pass a function as an argument which implements their own r -> h mapping.

RE: 9 - I misunderstood what you were doing. I thought the red/green points in this image https://cosmic-profiles.readthedocs.io/en/latest/_images/Oblate.png were the mock halo points, which would have implied that they were being generated in some sort of lattice. I still don't think this method is 100% correct, but the issues that I thought would be big problems aren't there, so I'm satisfied.

Still haven't worked out the MPI issue with our cluster staff, but I don't think we need to wait for that to be resolved: I can follow up with you if we figure out anything interesting. @adonath I'm happy to give my blessing to move forward.

adonath commented 1 year ago

Thanks a lot @phil-mansfield for the in-depth review and thanks @tibordome integrating the comments! I will to a final proof-read of the paper and then tag the editor-in-chief to prepare publication.

tibordome commented 1 year ago

@phil-mansfield, many thanks for your detailed feedback

I might allow the user to provide their own r -> h mapping in some later version, I'll keep that in mind.

@adonath: The updated paper is on the dev branch, https://github.com/tibordome/cosmic_profiles/blob/dev/paper/paper.md.

adonath commented 1 year ago

@editorialbot set dev as branch

editorialbot commented 1 year ago

Done! branch is now dev

adonath commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

adonath commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1111/j.1365-2966.2009.15715.x is OK
- 10.1093/mnras/stac3766 is OK
- 10.3847/1538-3881/aadae0 is OK
- 10.3847/1538-4365/aaee8c is OK
- 10.1093/mnras/stac878 is OK

MISSING DOIs

- None

INVALID DOIs

- None