tcarleton / stagg

Spatiotemporal Aggregation for Climate Data
Other
15 stars 0 forks source link

Issue with staggregate_bin() #37

Open wallstein opened 2 weeks ago

wallstein commented 2 weeks ago

Hi,

First of all thanks for creating such a straightforward-to-use package! I have an issue with staggregate_bin() and was hoping you could help me solve it.

Description of the task: I have daily temperature data and want to split it into bins. For simplicity only two bins: below and above 30 degree celcius. I have polygons in the form of California Counties and want to aggregate to this spatial level, weighted by population (secondary_raster). The time is either kept at daily level or aggregated to monthly.

Problem: In the dataframe returned by staggregate_bin() I would assume that the bins sum to 1 for the daily level and the number of days in a month when aggregated to the monthly level. However, the bin variables ("bin_ninf_to_30" & "bin_30_to_inf") sum below 1 (or 28/30/31 depending on the month) when on the daily (monthly) level. I attached a picture from of the staggregate_bin() output. Do you have an idea what could cause this?

Troubleshooting: I already checked that there are no missing values in the rasters in the areas corresponding to the polygons. The weights also sum to one for each polygon in the overlay_weights, secondary_weights & overlay_secondary_weights.

Any help would be greatly appreciated :)

Best, Jonas

target_geometry <- counties(state = "CA", class = "sf") %>% st_transform(crs = 4326)
environmental_raster_imputed <- focal(environmental_raster, 31, "modal", na.policy="only", expand = T)
secondary_raster_imputed <- subst(secondary_raster_resampled, NA, 0)

# Compute the overlay weights for the target_geometry and raster grid
overlay_weights <- overlay_weights(polygons = target_geometry,       # Input raster data (temperature)
                                   polygon_id_col = "spatial_ID",  # Column in target_geometry representing unique ID
                                   grid = environmental_raster_imputed)     # Environmental raster (imputed)

# Compute secondary weights (population weights) by aligning the secondary raster with the temperature raster
secondary_weights <- secondary_weights(secondary_raster = secondary_raster_imputed, # Imputed secondary raster
                                      grid = environmental_raster_imputed,                      # Environmental raster to align with
                                      extent = "full")                            # Use the full extent of the Environmental raster

# Compute the overlay weights for the polygons and the raster grid, including the population weights
overlay_secondary_weights <- overlay_weights(polygons = target_geometry,                # Polygon data for California counties
                                              polygon_id_col = "spatial_ID",           # Column in target_geometry representing unique ID
                                              grid = environmental_raster_imputed,              # Temperature raster (imputed)
                                              secondary_weights = secondary_weights) # Use the secondary (population) weights

# Aggregate the data by creating a bin for temperatures above 30 degrees Celsius
binned_output_daily <- staggregate_bin(data = environmental_raster_imputed,           # Input raster data (temperature)
                                 overlay_weights = overlay_secondary_weights, # Population-weighted overlay weights
                                 time_interval = "1 day",         # Daily interval for binning
                                 daily_agg = "none",              # No daily aggregation, keep daily values
                                 time_agg = "day",                # Aggregate results for each day
                                 bin_breaks  = c(30))             # Bin for temperatures above 30°C

Screenshot 2024-11-05 at 5 17 22 PM

tliddell4 commented 2 weeks ago

Hi Jonas,

This is a great catch and will help us improve the package, though I'm very sorry you had to discover the issue and endure whatever headache came as a result.

For whatever reason, I mistakenly wrote the code such that the lower bound and upper bound are both exclusive. We are close to issuing an update to replace all of the raster functions in the package with terra packages, and will include this correction in that update.

If you would like, I can send you a corrected staggregate_bin() function (minus the terra changes) that should work until we can finish reviewing the full update. Thank you again for catching this and for providing such a clear explanation of the problem and things you tried - it made it very straightforward to address.

Best, Tyler

Edit: I just opened up github and saw the actual numbers. This may not be the issue unless you have a lot of temperature values that are exactly 30. Is this the case?

On Tue, Nov 5, 2024 at 5:21 PM Jonas Wallstein @.***> wrote:

Hi,

First of all thanks for creating such a straightforward-to-use package! I have an issue with staggregate_bin() and was hoping you could help me solve it.

Description of the task: I have daily temperature data and want to split it into bins. For simplicity only two bins: below and above 30 degree celcius. I have polygons in the form of California Counties and want to aggregate to this spatial level, weighted by population (secondary_raster). The time is either kept at daily level or aggregated to monthly.

Problem: In the dataframe returned by staggregate_bin() I would assume that the bins sum to 1 for the daily level and the number of days in a month when aggregated to the monthly level. However, the bin variables ("bin_ninf_to_30" & "bin_30_to_inf") sum below 1 (or 28/30/31 depending on the month) when on the daily (monthly) level. I attached a picture from of the staggregate_bin() output. Do you have an idea what could cause this?

Troubleshooting: I already checked that there are no missing values in the rasters in the areas corresponding to the polygons. The weights also sum to one for each polygon in the overlay_weights, secondary_weights & overlay_secondary_weights.

Any help would be greatly appreciated :)

Best, Jonas

target_geometry <- counties(state = "CA", class = "sf") %>% st_transform(crs = 4326) environmental_raster_imputed <- focal(environmental_raster, 31, "modal", na.policy="only", expand = T) secondary_raster_imputed <- subst(secondary_raster_resampled, NA, 0)

Compute the overlay weights for the target_geometry and raster grid

overlay_weights <- overlay_weights(polygons = target_geometry, # Input raster data (temperature) polygon_id_col = "spatial_ID", # Column in target_geometry representing unique ID grid = environmental_raster_imputed) # Environmental raster (imputed)

Compute secondary weights (population weights) by aligning the secondary raster with the temperature raster

secondary_weights <- secondary_weights(secondary_raster = secondary_raster_imputed, # Imputed secondary raster grid = environmental_raster_imputed, # Environmental raster to align with extent = "full") # Use the full extent of the Environmental raster

Compute the overlay weights for the polygons and the raster grid, including the population weights

overlay_secondary_weights <- overlay_weights(polygons = target_geometry, # Polygon data for California counties polygon_id_col = "spatial_ID", # Column in target_geometry representing unique ID grid = environmental_raster_imputed, # Temperature raster (imputed) secondary_weights = secondary_weights) # Use the secondary (population) weights

Aggregate the data by creating a bin for temperatures above 30 degrees Celsius

binned_output_daily <- staggregate_bin(data = environmental_raster_imputed, # Input raster data (temperature) overlay_weights = overlay_secondary_weights, # Population-weighted overlay weights time_interval = "1 day", # Daily interval for binning daily_agg = "none", # No daily aggregation, keep daily values time_agg = "day", # Aggregate results for each day bin_breaks = c(30)) # Bin for temperatures above 30°C

Screenshot.2024-11-05.at.5.17.22.PM.png (view on web) https://github.com/user-attachments/assets/fb9c8099-0d12-4644-9d6e-9b9d319b696c

— Reply to this email directly, view it on GitHub https://github.com/tcarleton/stagg/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYKEE2BXHTFQOEPRTUPE4EDZ7FVKPAVCNFSM6AAAAABRHYGOU6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZTMOBRGY3DKMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

wallstein commented 2 weeks ago

Hi Tyler,

Thanks for getting back to me so quickly! There are no values that are exactly 30. A few values are between 29.9 and 30.1 but that wouldn't impact the bounds, right? Do you have another idea what could cause this?

Best, Jonas

tliddell4 commented 2 weeks ago

Hi Jonas,

Yeah I believe we can rule out that issue. That the incorrect sum is always the same number leads me to believe this is related to weights, but I see you've already made sure those sum to one.

My other idea is that this could somehow be related to the fact that your environmental data is daily (though it's unclear to me how). We originally wrote the package for hourly data and only recently added the start_date and time_interval arguments for increased flexibility, so it's possible this had some unforeseen impacts. However, a quick check of some of the fractions near to .3792767 you'd expect to pop up if we were accidently excluding temporal layers (9/24, 12/31, 11/30, etc.) doesn't suggest this is the issue. If this decimal jumps out at you as significant in any way I'm not thinking of let me know.

Is your raster data publicly available and if so could you point me to where I can get it? If not, could you help me mimic its structure?

Thanks, Tyler

Edit: I've done some more digging, and I think I've figured it out, though I'd like to check before taking the time to write up an explanation and solution. If I'm correct, it's only this particular polygon that's summing to that number (and the others all sum to their own number consistently), and the raster you're working with has cells with a width (xres) that, when expressed as a decimal point, has many digits following the decimal (the data I've been working with has an xres of 0.04166675, for instance). Is this the case?

wallstein commented 2 weeks ago

Hi Tyler,

Yes you're absolutely right, the sum is consistent for a polygon but changes across polygons. I didn't notice that before, great observation! And yes the resolution is exactly such a number:

environmental_raster: class : SpatRaster dimensions : 231, 249, 457 (nrow, ncol, nlyr) resolution : 0.04166668, 0.04166668 (x, y)

Let me know if it's helpful for me to send my scripts & data. Also thanks again so much with your help on this!!

Best, Jonas

wallstein commented 2 weeks ago

Hi Tyler,

I thought catching the error might be easier with my code and data so I attached it. I wrote the code in R Markdown and included the HTML output for better readability. I left out some of the raw data as it was too big, but included all the processed data that I'm using for the aggregation. Let me know if you have any questions or if something is missing.

Thanks so much for your help so far, that was already very useful!

Best, Jonas

error_replication.zip