sntck / MODISTools

R package – retrieving & using MODIS data from NASA's ORNL DAAC archive
37 stars 29 forks source link

UpdateSubsets not working as expected #12

Closed ethanwhite closed 9 years ago

ethanwhite commented 9 years ago

I've been trying to use UpdateSubsets to help recover when DAAC hangs in the middle of a large download. Based on this part of the description:

returns a dataframe of those yet to be downloaded.

I expected that when some data had already been downloaded UpdateSubsets would return a shorter dataframe than that of the original data, which could then be passed to MODISSubsets to resume downloading the data. That doesn't seem to be the case for me. Here's a simple example:

> library(MODISTools)
> 
> coord_data <- data.frame(lat=(33:35), long=(-90:-88))
> coord_data$start.date <- rep(2000, nrow(coord_data))
> coord_data$end.date <- rep(2014, nrow(coord_data))
> 
> MODISSubsets(LoadDat = coord_data, Products = "MOD13Q1",
+              Bands = c("250m_16_days_NDVI"), Size = c(1,1),
+              SaveDir = ".")
Files downloaded will be written to /home/ethan/Dropbox/Research/Forecasting/BBSForecasting.
Found 3 unique time-series to download.
Getting subset for location 1 of 3...
Getting subset for location 2 of 3...
Getting subset for location 3 of 3...
Full subset download complete. Writing the subset download file...
Done! Check the subset download file for correct subset information and download messages.
> 
> coord_data <- data.frame(lat=(33:38), long=(-90:-85))
> coord_data$start.date <- rep(2000, nrow(coord_data))
> coord_data$end.date <- rep(2014, nrow(coord_data))
> unaquired_coord_data = UpdateSubsets(LoadDat = coord_data, StartDate = TRUE,
+                                      Dir = ".")
Found 6 unique time-series in original file
Found 3 subsets previously downloaded
> 
> MODISSubsets(LoadDat = unaquired_coord_data, Products = "MOD13Q1",
+              Bands = c("250m_16_days_NDVI"), Size = c(1,1),
+              SaveDir = ".")
Files downloaded will be written to /home/ethan/Dropbox/Research/Forecasting/BBSForecasting.
Found 6 unique time-series to download.
Getting subset for location 1 of 6...
Getting subset for location 2 of 6...
Getting subset for location 3 of 6...
Getting subset for location 4 of 6...
Getting subset for location 5 of 6...
Getting subset for location 6 of 6...
Full subset download complete. Writing the subset download file...
Done! Check the subset download file for correct subset information and download messages.

It appears that even though UpdateSubsets identified the 3 subsets that had been previously downloaded, running MODISSubsets on the output of UpdateSubsets results in downloading all 6 files.

coord_data and unacquired_coord_data contain the same sites, the only difference is presence of the ID column.

> coord_data
  lat long start.date end.date
1  33  -90       2000     2014
2  34  -89       2000     2014
3  35  -88       2000     2014
4  36  -87       2000     2014
5  37  -86       2000     2014
6  38  -85       2000     2014
> unaquired_coord_data
  lat long start.date end.date                                      ID
1  33  -90       2000     2014 Lat33.00000Lon-90.00000Start2000End2014
2  34  -89       2000     2014 Lat34.00000Lon-89.00000Start2000End2014
3  35  -88       2000     2014 Lat35.00000Lon-88.00000Start2000End2014
4  36  -87       2000     2014 Lat36.00000Lon-87.00000Start2000End2014
5  37  -86       2000     2014 Lat37.00000Lon-86.00000Start2000End2014
6  38  -85       2000     2014 Lat38.00000Lon-85.00000Start2000End2014

I'm probably just misunderstanding something about how UpdateSubsets is supposed to work. Any help you can provide in pointing me in the right direction would be appreciated.

ethanwhite commented 9 years ago

After a little looking around I think the problem may be related to the IDs. The files that are initially being downloaded have names of the form:

Lat33.00000Lon-90.00000Start2014-01-01End2014-12-31___MOD13Q1.asc

It looks like all what's being done to generate the IDs is to strip off ___MOD13Q1.asc, in which case these don't match the IDs that are being generated by UpdateSubsets, which are of the form:

Lat33.00000Lon-90.00000Start2000End2014

I think this means that when https://github.com/seantuck12/MODISTools/blob/master/R/UpdateSubsets.R#L57 is executed that the desired subsetting isn't happening.

sntck commented 9 years ago

Hi Ethan, yes as you've noticed UpdateSubsets is not dealing with the presence or absence of subset IDs in a clever way, which is causing this bug. I'm overhauling the code in a big way at the moment and this will be one of the functions getting a makeover. In the meantime, I'll push a quick fix as soon as possible.

sntck commented 9 years ago

I've pushed a fix to the master repository. It should now return a trimmed version of the input data.frame – where all subsets that have already been downloaded and saved in the specified directory are removed – whether or not subsets have ID names. Thanks for pointing out the bug, do let us know if you have any further problems.

sntck commented 9 years ago

A second reason why your example would not work would be that MODISSubsets downloaded data with StartDate = FALSE (default), whereas UpdateSubsets has StartDate = TRUE. We've since decided that having start dates as optional is a bad idea, as it is confusing and introduces the opportunity for users to unknowingly download the wrong time series. I'm in the process of deprecating this option from all functions so in future versions start dates will be compulsory.

ethanwhite commented 9 years ago

I've pushed a fix to the master repository.

Thanks for the quick fix! It looks like it's working great.

A second reason why your example would not work would be that MODISSubsets downloaded data with StartDate = FALSE (default), whereas UpdateSubsets has StartDate = TRUE.

Yeah, I figured that one our early this afternoon after hacking around the other issue. I agree that it's confusing so I think it's a good call on the update. Thanks for coming back to point it out to me.