skyportal / healpix-alchemy

SQLAlchemy extensions for HEALPix spatially indexed astronomy data
BSD 3-Clause "New" or "Revised" License
28 stars 12 forks source link
astronomy database python

PyPI Status GitHub Workflow Status codecov DOI

To acknowledge HEALPix Alchemy in a scholarly article, please cite the following two references.

HEALPix Alchemy paper

Singer, L. P., Parazin, B., Coughlin, M. W., et al. (2022). "HEALPix Alchemy: Fast All-Sky Geometry and Image Arithmetic in a Relational Database for Multimessenger Astronomy Brokers." Astronomical Journal 163 209. https://doi.org/10.3847/1538-3881/ac5ab8

Code archive on Zenodo

Singer, L. P., Parazin, B., Coughlin, M. W., et al. (2022). "skyportal/healpix-alchemy: Version 1.0.1." https://doi.org/10.5281/zenodo.5768564

HEALPix Alchemy

The healpix_alchemy Python package is an extension for the SQLAlchemy object relational mapper. It adds region and image arithmetic to PostgreSQL (version 14 and newer) databases. It accelerates queries between point clouds, regions, and images (sometimes known in the geospatial community as rasters) by storing multi-order indices in PostgreSQL's range types.

The healpix_alchemy project is designed for astronomy applications, particularly for cross-matching galaxy catalogs, observation footprints, and all-sky images like gravitational-wave probability sky maps or even dust maps. However, it could be used in any context in which geometry is embedded on a unit sphere.

healpix_alchemy is lean and minimalist because it leverages several existing projects: it consists of little more than a few lines of glue code to bind together MOCPy, SQLAlchemy, and PostgreSQL's range types.

healpix_alchemy serves a purpose similar to full-featured astronomy-focused spatial extensions like Q3C, H3C, and pg_healpix, and geospatial extensions like PgSphere and PostGIS. What sets healpix_alchemy apart from these is that it is written in pure Python and requires no server-side database extensions. Consequently, healpix_alchemy can be used with managed PostgreSQL databases in the cloud like Amazon RDS and Google Cloud SQL.

Theory of operation

HEALPix basics

HEALPix is a scheme for subdividing and indexing the unit sphere, originally described by Górski et al. (2005). Although it was originally designed for cosmic microwave background analysis, it has found many uses in astronomy, particularly through multi-order coverage (MOC) maps and hierarchical progressive surveys (HiPS) used extensively in the Aladin astronomical information system. It is also used by LIGO and Virgo to store and communicate gravitational-wave probability sky maps.

HEALPix can be thought of as a tree. At the lowest resolution, level 0, HEALPix subdivides the sphere into 12 equal-area base tiles, assigned integer indices 0 through 11. At level 1, each of the 12 base tiles is subdivided into 4 tiles. Every subsequent level divides each of the preceding level's tiles into 4 new tiles. At a given level, each of the base pixels has been divided into 4level pixels (nside = 2level pixels on each side). Thus there are npix = 12×4level pixels at a given resolution, assigned integer indices from 0 through (npix-1). This is called the NESTED indexing scheme. (There is also a RING indexing scheme in which the indices advance from east to west and then from north to south.)

A HEALPix tile, a node in the HEALPix tree, is fully addressed by three pieces of information: the indexing scheme (RING or NESTED), the resolution level (level or equivalently nside), and the pixel index (ipix, an integer between 0 and npix-1).

The image below, reproduced from https://healpix.jpl.nasa.gov, illustrates the first 4 levels of refinement of a HEALPix grid.

The first four levels of HEALPix subdivision of the unit sphere

HEALPix interval sets

A region on the sphere can be encoded by a collection of disjoint HEALPix tiles, potentially at a mix of different resolution levels. Typically, large low-resolution tiles are used on the interior of the region, and small high-resolution tiles are used on the boundary. This is called a multi-order coverage (MOC) map. An example, reproduced from the MOCPy documentation, is shown below.

An example HEALPix multi-order coverage map

Much like MOCs, LIGO/Virgo/KAGRA gravitational-wave probability sky maps are stored as multi-resolution HEALPix data sets, but with a vector of floating-point values attached to each tile. A multi-order refinement mesh from an example sky map is shown below, reproduced from Singer & Price (2015).

Example multi-order refinement mesh from a gravitational-wave probability sky map

Reinecke & Hivon (2015) introduced HEALPix interval sets as an alternative encoding of MOCs that enables fast and simple unions, intersections, and queries. In an interval set, each HEALPix tile is described by the interval of pixel indices at some very high resolution levelmax that are descendents of that tile. In an interval set, a region is encoded as a disjoint collection of such intervals. A tile with a NESTED address given by (level, ipix) may be described as the half-open interval

[ipix 4levelmax - level, (ipix + 1) 4levelmax - level).

We use levelmax = 29 because this is the highest resolution at which pixel indices can be stored in a signed 64-bit integer. At this resolution, each pixel is scarcely 0.4 milliarcseconds across.

The interval set representation is adventageous because there are simple and fast algorithms for interval arithmetic and set operations. Interval analysis appears in a suprising variety of scientific contexts from genomics to gravitational wave data quality. Because of the many business applications of interval arithmetic, intervals are also supported in the PostgreSQL database through its range types.

Spatial primitives in healpix_alchemy

The healpix_alchemy package provides two custom column types for SQLAlchemy:

healpix_alchemy.Point

This class represents a point. A column of this type could store the positions of galaxies in a catalog. Under the hood, it is just a BIGINT.

Wherever you need to bind a Python value to a healpix_alchemy.Point, you may provide any one of the following:

healpix_alchemy.Tile

This class represents a HEALPix tile. A table containing a column of this type and a foreign key could store MOCs or gravitational-wave probability maps. Under the hood, it is just an INT8RANGE.

Wherever you need to bind a Python value to a healpix_alchemy.Tile, you may provide any one of the following:

Installation

You can install healpix_alchemy from the Python Package Index using pip:

$ pip install healpix-alchemy

Development Installation

Contributions are welcome! This package uses the Poetry packaging and dependency tool and pytest for unit tests. To install healpix_alchemy in a development environment, follow these instructions.

  1. Install Poetry by following the official Poetry installation instructions.

  2. Clone this repository:

    $ git clone https://github.com/skyportal/healpix-alchemy.git
    $ cd healpix-alchemy
  3. Initialize the Poetry-managed virtual environment with healpix_alchemy and all of its dependencies installed by running this command:

    $ poetry install

    Now, you can enter a shell inside the virtual environment by running:

    $ poetry shell
  4. To run the test suite, including the examples in this README file, run this command inside the Poetry shell:

    $ pytest

Example

First, some imports:

>>> from sqlalchemy import orm
>>> import sqlalchemy as sa
>>> import healpix_alchemy as ha

Set up tables

This example will use the SQLAlchemy declarative extension for describing table schema using Python classes.

SQLAlchemy needs to know the name for each table. You can provide the name by setting the __tablename__ attribute in each model class, or you can create a base class that generates the table name automatically from the class name.

>>> @orm.as_declarative()
... class Base:
...
...     @orm.declared_attr
...     def __tablename__(cls):
...         return cls.__name__.lower()

Each row of the Galaxy table represents a point in a catalog:

>>> class Galaxy(Base):
...     id = sa.Column(sa.Text, primary_key=True)
...     hpx = sa.Column(ha.Point, index=True, nullable=False)

Each row of the Field table represents a ZTF field:

>>> class Field(Base):
...     id = sa.Column(sa.Integer, primary_key=True)
...     tiles = orm.relationship(lambda: FieldTile)

Each row of the FieldTile table represents a multi-resolution HEALPix tile that is contained within the corresponding field. There is a one-to-many mapping between Field and FieldTile.

>>> class FieldTile(Base):
...     id = sa.Column(sa.ForeignKey(Field.id), primary_key=True)
...     hpx = sa.Column(ha.Tile, primary_key=True, index=True)

Each row of the Skymap table represents a LIGO/Virgo HEALPix localization map.

>>> class Skymap(Base):
...     id = sa.Column(sa.Integer, primary_key=True)
...     tiles = orm.relationship(lambda: SkymapTile)

Each row of the SkymapTile table represents a multi-resolution HEALPix tile within a LIGO/Virgo localization map. There is a one-to-many mapping between Skymap and SkymapTile.

>>> class SkymapTile(Base):
...     id = sa.Column(sa.ForeignKey(Skymap.id), primary_key=True)
...     hpx = sa.Column(ha.Tile, primary_key=True, index=True)
...     probdensity = sa.Column(sa.Float, nullable=False)

Finally, connect to the database, create all the tables, and start a session.

>>> engine = sa.create_engine('postgresql://user:password@host/database')
>>> Base.metadata.create_all(engine)
>>> session = orm.Session(engine)

Populate with sample data

Load the 2MASS Redshift Survey into the Galaxy table. This catalog contains 44599 galaxies.

It may take up to a minute for this to finish. Advanced users may speed this up significantly by vectorizing the conversion from SkyCoord to HEALPix indices and using SQLAlchemy bulk insertion.

>>> from astropy.coordinates import SkyCoord
>>> from astroquery.vizier import Vizier
>>> vizier = Vizier(columns=['SimbadName', 'RAJ2000', 'DEJ2000'], row_limit=-1)
>>> data, = vizier.get_catalogs('J/ApJS/199/26/table3')
>>> data['coord'] = SkyCoord(data['RAJ2000'], data['DEJ2000'])
>>> for row in data:
...     session.add(Galaxy(id=row['SimbadName'], hpx=row['coord']))
>>> session.commit()

Load the footprints of the Zwicky Transient Facility fields into the Field and FieldTile tables.

It may take up to a minute for this to finish. Advanced users may speed this up significantly by using SQLAlchemy bulk insertion.

>>> from astropy.table import Table
>>> from astropy.coordinates import SkyCoord
>>> from astropy import units as u
>>> url = 'https://raw.githubusercontent.com/ZwickyTransientFacility/ztf_information/9fd0ba8842709f42a134c88827309ccab728fcb7/field_grid/ztf_field_corners.csv'
>>> for row in Table.read(url):
...     field_id = int(row['field'])
...     corners = SkyCoord(row['ra1', 'ra2', 'ra3', 'ra4'],
...                        row['dec1', 'dec2', 'dec3', 'dec4'],
...                        unit=u.deg)
...     tiles = [FieldTile(hpx=hpx) for hpx in ha.Tile.tiles_from(corners)]
...     session.add(Field(id=field_id, tiles=tiles))
>>> session.commit()

Load a sky map for LIGO/Virgo event GW200115_042309 (S200115j) into the Skymap and SkymapTile tables.

>>> url = 'https://gracedb.ligo.org/apiweb/superevents/S200115j/files/bayestar.multiorder.fits'
>>> data = Table.read(url)
>>> tiles = [SkymapTile(hpx=row['UNIQ'], probdensity=row['PROBDENSITY']) for row in data]
>>> session.add(Skymap(id=1, tiles=tiles))
>>> session.commit()

Last, run ANALYZE to prepare the data for use:

>>> session.execute(sa.text('ANALYZE'))
<sqlalchemy.engine.cursor.CursorResult object at 0x...>

Sample Queries

What is the area of each field?

>>> query = sa.select(
...     FieldTile.id, sa.func.sum(FieldTile.hpx.area)
... ).group_by(
...     FieldTile.id
... ).limit(
...     5
... )
>>> for id, area in session.execute(query):
...     print(f'Field {id} has area {area:.3g} sr')
Field 199 has area 0.0174 sr
Field 200 has area 0.0174 sr
Field 201 has area 0.0174 sr
Field 202 has area 0.0174 sr
Field 203 has area 0.0174 sr

How many galaxies are in each field?

>>> count = sa.func.count(Galaxy.id)
>>> query = sa.select(
...     FieldTile.id, count
... ).filter(
...     FieldTile.hpx.contains(Galaxy.hpx)
... ).group_by(
...     FieldTile.id
... ).order_by(
...     count.desc()
... ).limit(
...     5
... )
>>> for id, n in session.execute(query):
...     print(f'Field {id} contains {n} galaxies')
Field 1739 contains 343 galaxies
Field 699 contains 336 galaxies
Field 700 contains 311 galaxies
Field 225 contains 303 galaxies
Field 1740 contains 289 galaxies

What is the probability density at the position of each galaxy?

>>> query = sa.select(
...     Galaxy.id, SkymapTile.probdensity
... ).filter(
...     SkymapTile.id == 1,
...     SkymapTile.hpx.contains(Galaxy.hpx)
... ).order_by(
...     SkymapTile.probdensity.desc()
... ).limit(
...     5
... )
>>> for id, p in session.execute(query):
...     print(f'{id} has prob. density {p:.5g}/sr')
2MASX J02532153+0632222 has prob. density 20.701/sr
2MASX J02530482+0555431 has prob. density 20.695/sr
2MASX J02533119+0628252 has prob. density 20.669/sr
2MASX J02524584+0639206 has prob. density 20.656/sr
2MASX J02534120+0615562 has prob. density 20.567/sr

What is the probability contained within each field?

>>> area = (FieldTile.hpx * SkymapTile.hpx).area
>>> prob = sa.func.sum(SkymapTile.probdensity * area)
>>> query = sa.select(
...     FieldTile.id, prob
... ).filter(
...     SkymapTile.id == 1,
...     FieldTile.hpx.overlaps(SkymapTile.hpx)
... ).group_by(
...     FieldTile.id
... ).order_by(
...     prob.desc()
... ).limit(
...     5
... )
>>> for id, prob in session.execute(query):
...     print(f'Field {id} probability is {prob:.3g}')
Field 1499 probability is 0.165
Field 1446 probability is 0.156
Field 452 probability is 0.154
Field 505 probability is 0.0991
Field 401 probability is 0.0962

What is the combined area of fields 1000 through 2000?

In the next two examples, we introduce healpix_alchemy.func.union() which finds the union of a set of tiles. Because it is an aggregate function, it should generally be used in a subquery.

>>> union = sa.select(
...     ha.func.union(FieldTile.hpx).label('hpx')
... ).filter(
...     FieldTile.id.between(1000, 2000)
... ).subquery()
>>> query = sa.select(
...     sa.func.sum(union.columns.hpx.area)
... )
>>> result = session.execute(query).scalar_one()
>>> print(f'{result:.3g} sr')
9.33 sr

What is the integrated probability contained within fields 1000 through 2000?

>>> union = sa.select(
...     ha.func.union(FieldTile.hpx).label('hpx')
... ).filter(
...     FieldTile.id.between(1000, 2000)
... ).subquery()
>>> prob = sa.func.sum(SkymapTile.probdensity * (union.columns.hpx * SkymapTile.hpx).area)
>>> query = sa.select(
...     prob
... ).filter(
...     SkymapTile.id == 1,
...     union.columns.hpx.overlaps(SkymapTile.hpx)
... )
>>> result = session.execute(query).scalar_one()
>>> print(f'{result:.3g}')
0.837

What is the area of the 90% credible region?

>>> cum_area = sa.func.sum(
...     SkymapTile.hpx.area
... ).over(
...     order_by=SkymapTile.probdensity.desc()
... ).label(
...     'cum_area'
... )
>>> cum_prob = sa.func.sum(
...     SkymapTile.probdensity * SkymapTile.hpx.area
... ).over(
...     order_by=SkymapTile.probdensity.desc()
... ).label(
...     'cum_prob'
... )
>>> subquery = sa.select(
...     cum_area,
...     cum_prob
... ).filter(
...     SkymapTile.id == 1
... ).subquery()
>>> query = sa.select(
...     sa.func.max(subquery.columns.cum_area)
... ).filter(
...     subquery.columns.cum_prob <= 0.9
... )
>>> result = session.execute(query).scalar_one()
>>> print(f'{result:.3g} sr')
0.277 sr

Which galaxies are within the 90% credible region?

>>> cum_prob = sa.func.sum(
...     SkymapTile.probdensity * SkymapTile.hpx.area
... ).over(
...     order_by=SkymapTile.probdensity.desc()
... ).label(
...     'cum_prob'
... )
>>> subquery = sa.select(
...     SkymapTile.probdensity,
...     cum_prob
... ).filter(
...     SkymapTile.id == 1
... ).subquery()
>>> min_probdensity = sa.select(
...     sa.func.min(subquery.columns.probdensity)
... ).filter(
...     subquery.columns.cum_prob <= 0.9
... ).scalar_subquery()
>>> query = sa.select(
...     Galaxy.id
... ).filter(
...     SkymapTile.id == 1,
...     SkymapTile.hpx.contains(Galaxy.hpx),
...     SkymapTile.probdensity >= min_probdensity
... ).limit(
...     5
... )
>>> for galaxy_id, in session.execute(query):
...     print(galaxy_id)
2MASX J02424077-0000478
2MASX J02352772-0921216
2MASX J02273746-0109226
2MASX J02414523+0026354
2MASX J20095408-4822462