rmendels / rerddapXtracto

xtractomactic using rerddap
Other
14 stars 4 forks source link

Lengthy extraction - correctly setup? Progress bar possibility? #20

Closed SimonDedman closed 4 years ago

SimonDedman commented 4 years ago

Hi Roy, hope you're well. I'm wondering if you might lend your thoughts to a few selected issues I'm having? I'm trying to extract chlorophyll for ~39000 lat-lon points with the code below. This takes ages - the first subset (6922 points) took a few hours and ended with connection number (roughly) 12000, i.e. 1.9X the number of points. This may be due to automatic retries? The second took many hours, I went out for dinner, and returned to my machine having hung (probably unrelated, I think one of my RAM sticks is faulty). So question 1, is my approach to feeding rxtracto 3 vectors of x, y, t points inefficient for any reason? It works, but I'm wondering if I'm missing something that would work faster. It feels like requesting 6900 single value data from a server should be a quick process, but maybe the arrangement of requesting them one-by-one in individual http calls is the answer?

Question 2: while the verbose parameter is nice, I'm wondering if this would be more useful if logged to a file? The info flies past so quickly it's functionally impossible to read. This would also allow for:

Question 3: would it be possible to add a progress bar? Typical code for this is

Total <- 20
pb <- txtProgressBar(min = 0, max = total, style = 3) # create progress bar
for (i in 1:total) {   setTxtProgressBar(pb, i) # update progress bar
  #your code
} close(pb)

Question 4 / possible bug: I noticed that the STOP button in RStudio does nothing if pressed while a rextracto call is running.

Thanks in advance for your thoughts. The code I'm using:

urlbase <- "http://coastwatch.pfeg.noaa.gov/erddap/"
parameter <- 'chlorophyll'
xlen <- 0.1 
ylen <- 0.1
df_i$ChlA <- rep(NA, nrow(df_i)) # add NA chlA
df_i <- dplyr::arrange(df_i, Date)# order df_i by date
datespre <- which(df_i$Date >= "1997-09-02" & df_i$Date < "2003-01-05" & !is.na(df_i$lat) & !is.na(df_i$lon))
datespost <- which(df_i$Date >= "2003-01-05" & !is.na(df_i$lat) & !is.na(df_i$lon))
dataset <- 'erdSW2018chla8day' # 1997-09-02T00:00:00Z, 2010-12-15T00:00:00Z
dataInfo <- rerddap::info(dataset, url = urlbase)
rerddap::cache_delete_all(force = TRUE)
chl_pre <- rxtracto(dataInfo,
                    parameter = parameter,
                    xcoord = df_i[datespre,"lon"],
                    ycoord = df_i[datespre,"lat"],
                    tcoord = df_i[datespre,"Date"],
                    xlen = xlen,
                    ylen = ylen,
                    verbose = TRUE)
df_i[datespre,"ChlA"] <- chl_pre$`mean chlorophyll`
dataset <- 'erdMH1chla8day' # 2003-01-05T00:00:00Z, 2019-04-27T00:00:00Z
dataInfo <- rerddap::info(dataset, url = urlbase)
rerddap::cache_delete_all(force = TRUE)
chl_post <- rxtracto(dataInfo,
                     parameter = parameter,
                     xcoord = df_i[datespost,"lon"],
                     ycoord = df_i[datespost,"lat"],
                     tcoord = df_i[datespost,"Date"],
                     xlen = xlen,
                     ylen = ylen,
                     verbose = TRUE)
df_i[datespost,"ChlA"] <- chl_pre$`mean chlorophyll`
rmendels commented 4 years ago

@SimonDedman Thanks for the feedback. I am on an extended leave so it may be awhile before I would get to this, but one thing that would help is if you can save say the first 100 entries of xcoord, ycoord and tcoord for each variable and send to me that so I can get in a debugger and see what is happening. I would however question the use of the MH1 8-day chlorophyll dataset for your purposes. The reason is this is a NASA product, and unlike our 8-day composites which are running means, their 8-day products are non-overlapping 8-day periods, and I imagine a lot of time is being wasted in the code trying to figure out which time to use, and also that you are likely often requesting the same thing, though the code tries to catch that.

I will look at the option of saving to file (unlikely I will do this), and a progress bar (more likely). Saving to file causes problems with CRAN. I have to explicitly get an okay from the user to write to a users space, otherwise I can only write to the temporary directories using standard R calls for this.

First, can i ask if you have the very latest version or rerddapXtracto, if unsure update from CRAN. As for the hangups, our server gets hit pretty hard, and our Internet speed stinks, so there are often timeouts on requests (which is why there are multiple tries each request), and often this is coming from the user's side (their system will only keep a connection open for a given amount of time) rather than our side. We have found it is not uncommon for a reasonably large number of requests that people have to break it up, and which is also why I return what has been gotten so far. I would turn off "verbose = TRUE", it mainly helps to see if the URL's are being returned correctly but slows things down a lot, and if the program time's out and has to stop it returns the error message.

Also, I haven't done the calculation but since xlen and ylen are not zero, each call may be downloading more data than you realize. For a very long track, if your data is in a relative limited area, and if you know how to read netcdf files (or I can provide some help) and are comfortable with R, you might do better just doing one download of a netcdf file that contains your lat-lon-time bounds, and then doing the extracts locally (or you may have to download several such files if the requests are too big and combine them locally). You can do such an extract using either rerddap or the rxtracto_3D()

rmendels commented 4 years ago

@SimonDedman And i should add I do nothing unusual in the code, I can't answer as to why RStudio won't stop the code, though I find that happens lot in other code I have n RStudio and I need to restart R. I find I do not run into the same problems with CL R or for the Mac the built in interface. You might set it up to run as a "Job" in RStudio, and see if that helps to be able to stop it

SimonDedman commented 4 years ago

Thanks for the quick replies Roy. I'll try with verbose off, and x/ylen=0, and see how I get on. I believe I changed x&ylen to 0.1 after scrutinising the code and finding "0." as the default, and presuming it needed to be a positive length value. If 0 is acceptable/recommended for point files, I imagine that'll help things.

Regarding my choice of dataset, I guess this is another example of the troubles I've had finding the 'right' datasets from ERDDAP. Cara's been very helpful but the impression I'm left with is that selecting the right one is ease to do wrong, especially given how many datasets are in the catalogue. By 'our 8 day composites', I presume you mean NOAA NMFS ERD SWFSC so I filterd my search by that but still get 58 results; for fish spanning the eastern Med to Western GoM, what would you suggest? I like the look of erdMH1chla1day but previously opted for 8-day since it's likely to have better coverage. Thanks again!

rmendels commented 4 years ago

most people use xlen and ylen for one of two purposes. The first is they feel there is error in the location, so it gives a value that covers the possible locations. The other is to get a certain amount of spatial smoothing. But yes, a value of zero should work. If not I have a bug.

SimonDedman commented 4 years ago

Thanks Roy. It's running now with 0 for both so hopefully the output will be correct. Another thing that occurred to me that might be useful for rxtracto, since I imagine I'm not the only user running slow extractions, would be a job completion notification. I use the beepr package in all my scripts nowadays, either for error alerts: options(error = function() beep(9)) or success completion: beep(8) Notwithstanding overuse of this makes your office sound like an annoying toddler trying to break a xylophone, it's a potentially useful option if you're so minded!

rmendels commented 4 years ago

Will consider it - I welcome feature requests. It is a matter of when have time to go in and do it and test it, and speaking of toddlers (but not annoying) the reason I am on extended leave is to help take care of my granddaughter.

rmendels commented 4 years ago

@SimonDedman Will consider it - I welcome feature requests. It is a matter of when have time to go in and do it and test it, and speaking of toddlers (but not annoying) the reason I am on extended leave is to help take care of my granddaughter.

SimonDedman commented 4 years ago

No rush at all - enjoy your time off! FWIW, small subset has just run successfully in 46 mins; about to start large one which should take 3.5 hours, so should have it all sorted tonight. Cheers squire!

rmendels commented 4 years ago

@SimonDedman re: which dataset to use in an analysis. We view our job as not only making the data available but also helping people with using it properly. Contact info for the coastwatch node can be found at https://coastwatch.pfeg.noaa.gov/feedback.html. In particular, Guestbook Entries are sent to I believe four of us.

SimonDedman commented 4 years ago

@rmendels Thanks Roy, good to know, and Cara's been very helpful when I've popped in to see her.

rmendels commented 4 years ago

@SimonDedman cara is one of the people on the Guestbook list