kapelner / bartMachine

An R-Java Bayesian Additive Regression Trees implementation
MIT License
61 stars 27 forks source link

Spatial structure in the (regression) residuals #52

Open nikosGeography opened 11 months ago

nikosGeography commented 11 months ago

First of all, thank you for creating this package, the documentation is very clear and easy to follow. I apologize in advance for the long post, but I'd like to make my problem as clear as possible.

I have a response variable which is the nighttime light luminosity (called ntl in the data.frame I am using). As you can see in the image below, it has some bright spots (highlighted in red) which are areas with high brightness (outliers in the data.frame).

ntl

Also, here is the histogram of the response. It's clear that the distribution is right-skewed (and possibly there is bimodality?). histogram

Purpose of my analysis My goal is to predict the ntl from the coarse spatial resolution to a higher. This means that, I need to maintain these bright spots (the outliers) in the predicted image. Because at a later stage, I will downscale the residuals of the regression using area-to-point Kriging, I need them to be random (no spatial structure).

Analysis Following the approach found in your paper, I created this code:

options(java.parameters = "-Xmx5g")
library("bartMachine")
set_bart_machine_num_cores(3)

# set working directory
wd <- "path/"

# Projected reference system (in order to convert the residuals into a raster image)
provoliko <- "EPSG:24313"

# original df
df <- read.csv(paste0(wd, 'block.data.csv'))

# extract the x and y columns (coordinates) from the df
crds <- df[, 1:2]

# here I keep only the necessary columns for my analysis
keep <- c("ntl", "pop", "agbh", "nir", "ebbi", "ndbi", "road", "pan", "nbai", 
          "tirs")
df <- df[keep]

x <- df[, 2:10]
y <- df[, 1]

bart_machine <- bartMachine(x, y)
bart_machine

The output of the default bartMachine model is:

> bart_machine
bartMachine v1.3.4.1 for regression

training data size: n = 5658 and p = 9 
built in 11.7 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 76.35 
avg sigsq estimate after burn-in: 40.8314 

in-sample statistics:
 L1 = 21551.53 
 L2 = 208688.47 
 rmse = 6.07 
 Pseudo-Rsq = 0.83
p-val for shapiro-wilk test of normality of residuals: 0 
p-val for zero-mean noise: 0.98127

Using the function plot_convergence_diagnostics(bart_machine), the result is: residuals_diagnostics

Again, for a "good" model I would like to see a symmetric scatter of points around the horizontal like at zero, indicating random deviations of predictions from the observed values, but from the plot above I see that this isn't the case.

Moreover, a map of the residuals is shown below. In red, I highlighted the areas were I believe the model couldn't model well (you can compare these areas to the areas in the first image above). bart_rsds As you can see, there is clearly a spatial structure (i.e., the residuals do not show a random pattern).

My question is, is there a way tell BART to consider the outliers (i.e., bright spots in the study area) as "more important" when modelling the NTL? What are your recommendations?

Because the csv I'm using has several thousands of rows, I can share it via a link, from here. Just so you know, running a model with default parameters takes less than 30 seconds on my laptop (8 gigs of RAM, 4 cores processor (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)).

Lastly, I tried to use a model computed by the bartMachineCV function (as well as variable selection) but the results are not better.

kapelner commented 11 months ago

Cool project! Is there a reason you can't use the x, y coordinates (or r, rho coordinates) as regressors?

On Sat, Sep 30, 2023, 10:12 Nikolaos Tziokas @.***> wrote:

First of all, thank you for creating this package, the documentation is very clear and easy to follow. I apologize in advance for the long post, but I'd like to make my problem as clear as possible.

I have a response variable which is the nighttime light luminosity (called ntl in the data.frame I am using). As you can see in the image below, it has some bright spots (highlighted in red) which are areas with high brightness (outliers in the data.frame).

[image: ntl] https://user-images.githubusercontent.com/45749751/271774959-7cfc88c9-d753-4fd3-862c-b93f7f184ac5.png

Also, here is the histogram of the response. It's clear that the distribution is right-skewed (and possibly there is bimodality?). [image: histogram] https://user-images.githubusercontent.com/45749751/271775040-dbcc26e7-23c6-496b-8e38-6738a62fc56c.png

Purpose of my analysis My goal is to predict the ntl from the coarse spatial resolution to a higher. This means that, I need to maintain these bright spots (the outliers) in the predicted image. Because at a later stage, I will downscale the residuals of the regression using area-to-point Kriging, I need them to be random (no spatial structure).

Analysis Following the approach found in your paper https://www.jstatsoft.org/article/view/v070i04, I created this code:

options(java.parameters = "-Xmx5g") library("bartMachine") set_bart_machine_num_cores(3)

set working directory

wd <- "path/"

Projected reference system (in order to convert the residuals into a raster image)

provoliko <- "EPSG:24313"

original df

df <- read.csv(paste0(wd, 'block.data.csv'))

extract the x and y columns (coordinates) from the df

crds <- df[, 1:2]

here I keep only the necessary columns for my analysis

keep <- c("ntl", "pop", "agbh", "nir", "ebbi", "ndbi", "road", "pan", "nbai", "tirs") df <- df[keep]

x <- df[, 2:10] y <- df[, 1]

bart_machine <- bartMachine(x, y) bart_machine

The output of the default bartMachine model is:

bart_machine bartMachine v1.3.4.1 for regression

training data size: n = 5658 and p = 9 built in 11.7 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 76.35 avg sigsq estimate after burn-in: 40.8314

in-sample statistics: L1 = 21551.53 L2 = 208688.47 rmse = 6.07 Pseudo-Rsq = 0.83 p-val for shapiro-wilk test of normality of residuals: 0 p-val for zero-mean noise: 0.98127

Using the function plot_convergence_diagnostics(bart_machine), the result is: [image: residuals_diagnostics] https://user-images.githubusercontent.com/45749751/271776302-cf80f431-e820-48e2-9827-8912641d970f.png

Again, for a "good" model I would like to see a symmetric scatter of points around the horizontal like at zero, indicating random deviations of predictions from the observed values, but from the plot above I see that this isn't the case.

Moreover, a map of the residuals is shown below. In red, I highlighted the areas were I believe the model couldn't model well (you can compare these areas to the areas in the first image above). [image: bart_rsds] https://user-images.githubusercontent.com/45749751/271777072-2b645d5a-6b19-4a9a-90a2-1d8f80c6e610.png As you can see, there is clearly a spatial structure (i.e., the residuals do not show a random pattern).

My question is, is there a way tell BART to consider the outliers (i.e., bright spots in the study area) as "more important" when modelling the NTL? What are your recommendations?

Because the csv I'm using has several thousands of rows, I can share it via a link, from here https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=drive_link. Just so you know, running a model with default parameters takes less than 30 seconds on my laptop (8 gigs of RAM, 4 cores processor (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)).

Lastly, I tried to use a model computed by the bartMachineCV function (as well as variable selection) but the results are not better.

— Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/52, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6G5RQTQQZL2S6HKUPDX5ASEVANCNFSM6AAAAAA5NSV7GA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nikosGeography commented 11 months ago

I have seen people adding the coordinates in their regression (for example the work from Hengl et al., 2018) claiming that by doing this they account for the spatial dependency (which might be true). In my case, I've noticed that by using the x, y ccoordinates as covatiates the predicted NTL image is smoother compared to the predicted image without these columns (although the overall R2 of the regression is enhanced when using the coordinates).

kapelner commented 11 months ago

Would it be legal (given your prediction context) to use neighboring y values?

On Mon, Oct 2, 2023, 05:03 Nikolaos Tziokas @.***> wrote:

I have seen people adding the coordinates in their regression (for example the work from Hengl et al., 2018) claiming that by doing this they account for the spatial dependency (which might be true). In my case, I've noticed that by using the x, y ccoordinates as covatiates the predicted NTL image is smoother compared to the predicted image without these columns (although the overall R2 of the regression is enhanced when using the coordinates).

— Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/52#issuecomment-1742650702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6FBVWQD7YV2PNVVF3TX5J7NJANCNFSM6AAAAAA5NSV7GA . You are receiving this because you commented.Message ID: @.***>

nikosGeography commented 11 months ago

Can you explain a bit more what do you mean?

kapelner commented 11 months ago

Use the ntl values from neighboring pixels as features.

On Mon, Oct 2, 2023, 07:29 Nikolaos Tziokas @.***> wrote:

Can you explain a bit more what do you mean?

— Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/52#issuecomment-1742845703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6AYZBZI2ZAGXJ7WFXLX5KQP3ANCNFSM6AAAAAA5NSV7GA . You are receiving this because you commented.Message ID: @.***>

nikosGeography commented 11 months ago

Hmm.. if I understand so you correctly you mean to pass a focal function (e.g., average) to the NTL and the resulting filtered raster to use it as a predictor along with the other predictors?

kapelner commented 11 months ago

It sounds like it's worth a try.

On Mon, Oct 2, 2023, 08:28 Nikolaos Tziokas @.***> wrote:

Hmm.. if I understand so you correctly you mean to pass a focal function (e.g., average) to the NTL and the resulting filtered raster to use it as a predictor along with the other predictors?

— Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/52#issuecomment-1742925758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6DBD5BRHHGF6IYLQETX5KXQXAVCNFSM6AAAAAA5NSV7GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBSHEZDKNZVHA . You are receiving this because you commented.Message ID: @.***>

kapelner commented 10 months ago

How did this work out for you?

nikosGeography commented 10 months ago

So I'm still working on this. The approach in working on at the moment is I selected a smaller area in a way that the bright spots pixels are not considerably less than the rest of the pixels. Then I construct a model using this summer area and then I make predictions using in the whole area. The results are slightly better compared to the ones I posted here but I'm still investigating it. Also I included two new covariates.

I haven't stopped working in this issue. Feel free to close it if you want to and I post the solution when I find it.