r-tidy-remote-sensing / tidyrgee

Create tidyverse methods for dealing with GEE image and imageCollections.
Other
48 stars 3 forks source link

bandNames getting corrupted #37

Open zackarno opened 1 year ago

zackarno commented 1 year ago

Seems a new issue has popped up- will try to look for the solution and make a hotfix and test.

library(rgee)
library(tidyrgee)

ee_Initialize()
#> -- rgee 1.1.2.9000 ---------------------------------- earthengine-api 0.1.325 -- 
#>  v user: not_defined
#>  v Initializing Google Earth Engine: v Initializing Google Earth Engine:  DONE!
#>  v Earth Engine account: users/zackarno 
#> --------------------------------------------------------------------------------

gldas <- ee$ImageCollection("NASA/GLDAS/V021/NOAH/G025/T3H")
sm <- gldas$select("SoilMoi0_10cm_inst")
sm_2004_2009_mean <- sm$filterDate("2004-01-01","2009-12-31")$mean()
sm_2004_2009_mean$bandNames()$getInfo()
#> [1] "SoilMoi0_10cm_inst"

sm_2004_2009_mean_tidy<- as_tidyee(sm) |> 
  filter(year %in% c(2004:2009)) |> 
  summarise(stat="mean") 

# here we get the error
sm_2004_2009_mean_tidy$ee_ob$bandNames()$getInfo()
#> Error in py_call_impl(callable, dots$args, dots$keywords): ee.ee_exception.EEException: User memory limit exceeded.

Created on 2022-09-29 by the reprex package (v2.0.1)

zackarno commented 1 year ago

Ok now I am confused as issue appears to occur earlier... will walk through below

debugonce(summarise)
sm_2004_2009_mean_tidy<-  sm_tidy|>
  filter(year %in% c(2004:2009)) |>
  summarise(stat="mean")

This opens the debug mode - then I hit enter to step into the summarise method. Then before running any code I check to see if i can print the bandNames of the first image like this:

.data$ee_ob$first()$bandNames()$getInfo()
#> Error in py_call_impl(callable, dots$args, dots$keywords) : 
#>  ee.ee_exception.EEException: User memory limit exceeded.

@joshualerickson - can you reproduce this?

zackarno commented 1 year ago

Weirdly it works on a difference IC ... below I do the same process, but with chirps. There is a weird lag though when its performed on the tidyee_ob$ee_ob compared to just the straight rgee ImageCollection ... I dont know if this is related to the above issue. Perhaps it has something do do with the _stat being appended on to the bandName by {rgee}?

# 5 - try another IC -- chirps
chirps <- ee$ImageCollection("UCSB-CHG/CHIRPS/DAILY")
chirps <- chirps$select("precipitation")
chirps_2004_2009_mean <- chirps$filterDate("2004-01-01","2009-12-31")$mean()
chirps_2004_2009_mean$bandNames()$getInfo()
#> "precipitation"

# Here it works with CHIRPS, but it does seem to
# have a considerable lag compared to above
chirps_tidy <-  as_tidyee(chirps)

chirps_2004_2009_mean_tidy<-  chirps_tidy|>
  filter(year %in% c(2004:2009)) |>
  summarise(stat="mean")

chirps_2004_2009_mean_tidy$ee_ob$bandNames()$getInfo()
#>"precipitation_mean"
joshualerickson commented 1 year ago

Yeah I'm able to reproduce the bug above. Not sure why it doesn't like the tidyrgee way? But you might be onto something with the band name concatenation. Tried with roi = ee$Geometry$Point(-115.11353, 48.1380) and still the same thing...

zackarno commented 1 year ago

@joshualerickson - nice work pinning the issue on the filter by showing that the below experimentation of the reprex above works:

sm_2004_2009_mean_tidy<- as_tidyee(sm) |> 
  summarise(stat="mean") `

Additionally, I think you are right that problem stems from set_idx()... furthermore - I think it is this line in set_idx()

ic_list = x$toList(x$size())`

After this line is run (inside set_idx) in the example above the images in the list cause the memory limit exceeded error when queried with getInfo() - for example:

ee$Image(ic_list$get(0))$bandNames()$getInfo()
#> Error in py_call_impl(callable, dots$args, dots$keywords) : 
#>  ee.ee_exception.EEException: User memory limit exceeded.

a quick google search showed this related SO post: https://gis.stackexchange.com/questions/404226/getting-unique-data-gives-user-memory-limit-exceeded-using-google-earth-engine

I think if we can modify the function to map over the IC rather than IC list we should be good

zackarno commented 1 year ago

Well I created the function with a workaround to avoid this call : ic_list = x$toList(x$size())

workaround inspired by: https://gis.stackexchange.com/questions/374137/add-an-incremential-number-to-each-feature-in-a-featurecollection-in-gee

set_idx.ee.imagecollection.ImageCollection <-  function(x,idx_name="tidyee_index"){
  x <- x$sort("sytem:time_start")

  incr_index = ee$List$sequence(0,x$size()$subtract(1))
  sys_index = ee$List(x$aggregate_array('system:index'))

  # create key-value dictionary
  incr_sys_dict = ee$Dictionary$fromLists(sys_index, incr_index)
  ic_with_idx = x$map(
    function(img){
      # can  use dictionary as lookup to iterate through images and add incremental value
      img$set(idx_name,incr_sys_dict$get(img$get("system:index")))
    }
  )
  return(ic_with_idx)
}

but it still give the same memory limit error!

zackarno commented 1 year ago

wait ... actually the above solution fixes it (i think)... it does seem to cause long hang-ups on downstream getInfo() calls though hmmm,,, I can put the above into a new branch for further exploration/testing

zackarno commented 1 year ago

so the above doesnt really seem to consistently fix - usually it just freezes. One time I thought it worked, but now i'm doubting what I interpreted.

Additionally, while the error was occurring in the filter()/set_idx() .... i don't think it's confined there. Since GLDAS in the reprex above is huge with 66323 global images... i experimented with reducing this by year,month to get a smaller ic, but was surprised when I got the same error with no filter - reprex here:

gldas <- ee$ImageCollection("NASA/GLDAS/V021/NOAH/G025/T3H")
gldas_tidy <- as_tidyee(gldas)
gldas_monthly<- gldas_tidy |> 
  select("SoilMoi0_10cm_inst",
         "SoilMoi10_40cm_inst",
         "SoilMoi40_100cm_inst",
         "SoilMoi100_200cm_inst",
         "RootMoist_inst",
         "SWE_inst") |> 
  group_by(year,month) |> 
  summarise(stat="mean")

img_check<- gldas_monthly$ee_ob$first()
# this should work,but idk
img_check$bandNames()$getInfo()
#>Error in py_call_impl(callable, dots$args, dots$keywords) : 
#> ee.ee_exception.EEException: User memory limit exceeded.

I looked through the functions in summarise (i.e summarise_pixels, ee_year_month_composite) and there is no set_idx so the memory limit being hit somewhere else.

With debugonce I saw that the ic is in good shape going into ee_year_month_composite until we createthe ic_summarised object . Interestingly and hopefully insightfully, there is no issue with the above example if we group_by(month) or group_by(year) prior to summarising... its just the combination.