pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

Using positions in files rather than particle IDs #26

Closed JBorrow closed 5 years ago

JBorrow commented 5 years ago

Problem: At the moment, velociraptor outputs the .catalog_particles files containing all of the particle IDs that are available in a group. This means that when we want to find the particles that belong to a given halo/galaxy, we must read in all of the particle IDs in the snapshot and search through them.

Proposed solution: It would be great to have the position in file output here instead of the particle ID. Even better would be to have a hierarchical structure as follows:

File: halo.catalog_particles

Halos/
    PartType0/
        Halo0/
            <List of particle positions>
        Halo1/
            <List of particle positions>
        ...
    PartType1/
        Halo0/
            <List of particle positions>
        Halo1/
            <List of particle positions>
        ...
    ...
pelahi commented 5 years ago

Hi Josh, there is functionality to output extra information such as what file and where in the file a particle is located. I'll update this to improve its functionality and add it to the branches

MatthieuSchaller commented 5 years ago

Just to make sure, Josh means in the stand-alone version of VR, not the library bit.

JBorrow commented 5 years ago

I'm guessing by this you mean the FoF group output? At the moment that's pretty huge as it's stored as text -- in terms of file size dumping this as HDF5 (and e.g. with the structure above) would result in ~20x disk space saving

P.S. Sorry for my slow reply

jchelly commented 5 years ago

This is probably a bit late but I have a couple of comments:

First, having the IDs of particles in groups stored in the files is very useful for building merger trees because you need to know which particles are in which group at each snapshot but you don't really need the positions or velocities. This means you can build trees from just the halo finder output - you don't need to keep all of the snapshots on disk.

Second, HDF5 isn't really optimized for having huge numbers of tiny datasets so having one HDF5 group per halo could make the files larger and slower to read (although I think recent versions of HDF5 might have alleviated this a bit). It's also not great for python where the fastest way to deal with this type of data is to have it in one big array and use numpy vector functions to operate on all the particles at once.

pelahi commented 5 years ago

I will actually add a VR output that consists of all particles in groups, the file index they exist in (for swift, this would be not be necessary), the index in the file and their group id. Adds another output but allows for easy extraction for post processing. I will also add something similar for particles in a spherical overdensity but not in a FOF (or subhalo). The format would be single large arrays with particles ordered by group id. Combined with the information in catalog_groups that contains the number of particles in a group and offsets, it makes this information easy to parse. Thoughts?