onnela-lab / gptools

Gaussian processes on graphs and lattices in Stan and pytorch.
https://gptools-stan.readthedocs.io
BSD 3-Clause "New" or "Revised" License
15 stars 0 forks source link

Grid restrictions #25

Open andrewGhazi opened 1 week ago

andrewGhazi commented 1 week ago

Hi there, I'd like to make use of this package on a spatial transcriptomics dataset and wanted to ask about whether it could work with my data. The two details I'm wondering about are:

  1. My data lie on a triangular grid. Will gp_inv_rfft and friends work on any regularly spaced pattern or does it specifically require points that lie on the corners of tiled squares?
  2. The data fall in an irregular shape that also include holes. Am I right to assume that I'll need to pad in those irregularities?

The plot of measured points below demonstrates both of these issues. I'd appreciate any help you could provide. I started to dig into the math because it feels like both of these problems should be solvable in theory, but I thought I'd ask since I imagine it will be immediately obvious to the authors if this could be made to work or not. Thanks!

hex

tillahoffmann commented 1 week ago

Hey, thanks for getting in touch.

  1. The methods require a regularly spaced, rectangular grid, but a triangular grid can be accommodated by oversampling the space (I've attached an example below—apologies for the blurriness). This will of course "waste" some resources, but we've found it to run faster than a standard implementation because of the favorable scaling with sample size. What sort of grid sizes are you working with?
  2. Assuming you use a regular grid as in the illustration below you won't need to explicitly take care of the holes. Say the Gaussian process is $z{ij}$ at row $i$ and column $j$ in the grid. Then wherever you have data $y{ij}$, you can include a sampling statement $y{ij}\sim\mathsf{SomeLikelihood}\left(z{ij}\right)$. If you don't have data for a location, all you need to do is omit the sampling statement.

Another option to consider might be a nearest neighbor Gaussian process (see here for details). The Fourier methods are generally quite a bit faster though (see Figure 2 on page 13 here).

I saw you work for HMS. I'm just across the road at HSPH. Happy to meet up if you're around on campus.

386876321-8d411e18-6eca-45ee-9d78-4134554b704e

andrewGhazi commented 1 week ago

Hi there,

Ah okay, the suggestion of fill-in points to convert it to a rectangular grid makes perfect sense. I take it then that rectangular (as opposed to perfectly square) grids are okay?

The number of spots in a sample can go up to 25k, though 95% are less than 5k (the sample shown has 1.4k). We have a large number (500+) of correlated, ~negative binomially distributed features for each spot. My hope is to model all of these features plus some other complicating factors using joint correlated GPs, though I'm unsure if this is computationally feasible. You can see why FFT methods are appealing here given the spots lie on a nice grid. Nearest neighbor GPs are on the table too, but my guess is that they're suboptimal because there can be long-range correlations in the data -- a given tissue strata might span the whole width of the sample, which can create long bands of highly similar expression patterns across many spots. The covariance matrix might not be all that sparse.

I'll need a week or so to play around with this, but yes a meet up would be great. My office is in Countway so I think we're right next door. I'll send an email to your HSPH address to schedule.

tillahoffmann commented 1 week ago

I take it then that rectangular (as opposed to perfectly square) grids are okay?

Yes, in practice there are two ways to deal with the non-square grid.

  1. Use a vector of scale parameters (vector<lower=0> [2] length_scales; in the parameter block) to account for the different effective length scales and then pass the vector to the evaluation of the kernel, e.g., gp_periodic_exp_quad_cov_rfft2. However, this approach allows for different length scales in the physical space because the parameters are independent.
  2. Use a scalar length scale, construct a length scale vector based on the known aspect ratio of the grid, and then proceed as above. This ensures that the kernel is isotropic in the physical space. E.g., something like this.
parameters {
  real<lower=0> length_scale;
}

transformed parameters {
  vector<lower=0> [2] length_scales = (2 * length_scale, length_scale)';
}

The factor of two accounts for the oversampling in the horizontal direction. The grid is more densely spaced in this direction so, if we "rescaled" things to get a square grid, the correlation length has to be longer by a factor of two.

I think that's right, but I always get myself confused about where the factor of two should go. One way to double check would be to sample from the GP prior and make sure the posterior samples look as expected.

The number of spots in a sample can go up to 25k [...] My hope is to model all of these features plus some other complicating factors using joint correlated GPs, though I'm unsure if this is computationally feasible.

Yes, that scale may be tricky. How many correlated GPs are you using and what's the correlation structure?

If Stan isn't able to handle the scale, variational inference using numpyro may be an alternative. The components for FFT GPs are there (see e.g., pyro-ppl/numpyro#1762), but there's a bit of legwork to be done. Stan also implements variational inference, but it's a bit brittle.

I'll need a week or so to play around with this, but yes a meet up would be great. My office is in Countway so I think we're right next door. I'll send an email to your HSPH address to schedule.

Sounds great!