ropensci / software-review

rOpenSci Software Peer Review.
290 stars 104 forks source link

MODISTools onboarding #246

Closed khufkens closed 5 years ago

khufkens commented 6 years ago

Summary

Programmatic interface to the 'MODIS Land Products Subsets' web services (https://modis.ornl.gov/data/modis_webservice.html). Allows for easy downloads of 'MODIS' time series directly to your R workspace or your computer.

Package: MODISTools
Title: Interface to the 'MODIS Land Products Subsets' Web Services
Version: 1.0.0
Authors@R: person("Hufkens","Koen",
          email="koen.hufkens@gmail.com",
          role=c("aut", "cre"),
          comment = c(ORCID = "0000-0002-5070-8109"))
Description: Programmatic interface to the 'MODIS Land Products Subsets'
    web services (<https://modis.ornl.gov/data/modis_webservice.html>).
    Allows for easy downloads of 'MODIS' time series directly to your R 
    workspace or your computer.
URL: https://github.com/khufkens/MODISTools
BugReports: https://github.com/khufkens/MODISTools/issues
Depends:
    R (>= 3.4)
Imports:
    httr,
    utils,
    tidyr,
    jsonlite
License: AGPL-3
LazyData: true
ByteCompile: true
RoxygenNote: 6.0.1
Suggests:
    knitr,
    rmarkdown,
    covr,
    testthat
VignetteBuilder: knitr

https://github.com/khufkens/MODISTools

MODIS data subsets have a wide range of applications tracking state changes of the environment which inform among others ecological and hydrological models. For a an in depth rational I refer to Tuck et al. 2014 (https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.1273).

Not to my knowledge. It is the only community contribution listed at the bottom of the ORNL page. Although tiled data processing packages exist, this package has a particular focus on subsets of these tiles (from a single pixel to a maximum extend of 200x200km window), mostly for rapid model development and localized environmental assessment. Tiled data download packages focus on wall to wall remote sensing image coverage, often after using subsets for model development.

None made.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

This package is a refactored version of the original MODISTools package by Tuck et al. 2014. In order to limit fragmentation I kept this name although the functionality changed (due in part to changes on the backend API). In order to respect the original author's contribution I would like to keep this reference in place. I also asked approval of both first and second author of the original package to do so. However, they did not want to be involved in new development due to time constraints.

Detail

Yes, but might have missed something.

Please contact the previous package author for review:

@seantuck12

karthik commented 6 years ago

Thanks for this submission @khufkens! I will follow up shortly with next steps.

khufkens commented 6 years ago

No rush, I promised both @sckott and @seantuck12 to push this back to rOpensci back in June but the summer got in the way.

karthik commented 6 years ago

Editor checks:


Editor comments

👋 @khufkens The package looks great and my preliminary checks with good practice don't raise anything significant. Your test coverage is already high and you might consider fixing the other two suggestions as I line up reviewers.


Reviewer 1: @pmnatural Due date: September 28th, 2018 Reviewer 2: @etiennebr Due date: October 4th, 2018

── GP MODISTools ───────────────────── 
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 91% of code lines are covered
    by test cases.

    R/mt_bands.R:39:NA
    R/mt_batch_subset.R:58:NA
    R/mt_batch_subset.R:66:NA
    R/mt_batch_subset.R:130:NA
    R/mt_dates.R:37:NA
    ... and 15 more lines

  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used
    it and it is easier to read your code for them if you use '<-'.

    R/mt_batch_subset.R:72:10
    R/mt_dates.R:58:8
    R/mt_read.R:33:10
    R/mt_read.R:34:10
    R/mt_read.R:35:10
    ... and 28 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are
    error prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along()
    instead.

    R/mt_subset.R:167:23

──────────────────────────────────────────────────────────
karthik commented 6 years ago

I have reached out to 3 potential reviewers on Sep 4th. Will update thread when I hear back.

khufkens commented 6 years ago

I fixed most of the notes above, but haven't pushed them. Keep me posted on progress.

karthik commented 6 years ago

Thanks for agreeing to review @pmnatural 🙏 Please reach out if you have any questions.

karthik commented 6 years ago

Tagged @etiennebr as the second reviewer.

etiennebr commented 6 years ago

Thanks @karthik and @khufkens. I'm looking forward to review the package! @khufkens are you planning to push your updates before we review the package?

khufkens commented 6 years ago

I don't think I'll update the CRAN package before review. I'm not sure how this influences the process. I've mainly been focussing on addressing the issues above and interoperability and docs. I still have to go through this: https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html to fix some issues (e.g. create a ornl_api function etc).

etiennebr commented 6 years ago

Ok, I won't worry about the CRAN version; are the fix pushed to the github master branch?

khufkens commented 6 years ago

Yes, everything is in the master branch. Larger additions or refactoring is done on a separate one (but not applicable now).

etiennebr commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

skip

Functionality

Estimated hours spent reviewing: 10h (including reading all the material available to new reviewers)


Review Comments

Thanks @khufkens for your contribution. I believe the package can be useful to users looking for MODIS time series in specific locations and really simplifies the usage of the web API. I was surprised at first that it did not provide a raster object after download, but I understand how it can simplify the work of downloading data for multiple locations and dates. I find the code generally straightforward and easy to read, the comments are often helpful and clear. In my opinion the two most important points to improve for this package are the documentation of the MODISTools class and the unit tests not only to reach a higher proportion of coverage, but also test the functions more extensively to make the package easier to maintain.

I have a some miscellaneous comments and improvements to suggest, I think they could make the package even better by improving the users experience and facilitate maintenance.

products <- mt_products()

products %>% 
  mutate(frequency = gsub("-", " ", frequency)) %>%
  mutate(frequency = gsub("Day", "day", frequency)) %>% 
  mutate(frequency = gsub("Daily", "1 day", frequency)) %>% 
  mutate(frequency = gsub("Yearly", "1 year", frequency)) %>% 
  mutate(freq = duration(frequency)) %>% 
  select(freq, frequency)

#>                      freq frequency
#> 1        86400s (~1 days)     1 day
#> 2    31557600s (~1 years)    1 year
#> 3   691200s (~1.14 weeks)     8 day
#> 4       345600s (~4 days)     4 day
#> 5   691200s (~1.14 weeks)     8 day
#> 6   691200s (~1.14 weeks)     8 day
#> 7  1382400s (~2.29 weeks)    16 day
#> 8   691200s (~1.14 weeks)     8 day
#> 9   691200s (~1.14 weeks)     8 day
#> 10  691200s (~1.14 weeks)     8 day
khufkens commented 5 years ago

Hi @etiennebr, thank you for reviewing the package and the contributions to improve the package.

I've been chipping away on implementing some of your comments but I was wondering if I could get some more information on some of your remarks.

In particular, I was wondering how to restructure the data. You mentioned that, indeed, I can reformat things into a tidy data frame (with a duplication cost for some data, but reducing overall complexity of the data structure - a fair trade-off given that space is probably cheaper than cpu cycles).

However, this might be at odds with your request for a true bounding box using the ll-tr offsets as an sf polygon. The only other option in this context would be to formulate a function which translates the location data in the tidy data frame into a an sf polygon.

Does this sound like a good middle ground - balancing simplicity and compatibility?

etiennebr commented 5 years ago

You've probably been thinking about this problem longer than me, so you might have a better solution or ran into issues I haven't seen with my proposal. Here's one way to do it.

(I'm starting from the vignette)

mt_tidy <- function(x) {
  as_tibble(x$header) %>%
    mutate(data = list(x$data))
}
mt_tidy(subset)
#> # A tibble: 1 x 16
#>   xllcorner yllcorner cellsize nrows ncols band  units scale latitude longitude site  product
#>   <chr>     <chr>     <chr>    <int> <int> <chr> <chr> <chr>    <dbl>     <dbl> <chr> <chr>  
#> 1 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> # ... with 4 more variables: start <chr>, end <chr>, complete <lgl>, data <list>

mt_tidy(subset) %>% unnest()
#> # A tibble: 20 x 22
#>    xllcorner yllcorner cellsize nrows ncols band  units scale latitude longitude site  product
#>    <chr>     <chr>     <chr>    <int> <int> <chr> <chr> <chr>    <dbl>     <dbl> <chr> <chr>  
#>  1 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  2 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  3 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  4 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  5 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  6 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  7 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  8 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#>  9 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 10 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 11 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 12 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 13 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 14 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 15 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 16 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 17 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 18 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 19 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> 20 -9370036… 4447802.… 926.625…     1     1 LST_… Kelv… 0.02        40      -110 test… MOD11A2
#> # ... with 10 more variables: start <chr>, end <chr>, complete <lgl>, modis_date <chr>,
#   calendar_date <chr>, band1 <chr>, tile <chr>, proc_date <chr>, pixel <chr>, data <int>

It doesn't prevent an sfc column to be created (and repeated on unnest). Although, I'm not sure if all the information is available (It would me much more easier if the pixel size was in degree). You can even add a second geometry column for the requested coordinates.

I hope this helps, but don't hesitate if that's not the case. As I said above, maybe I'm overlooking some limitations.

khufkens commented 5 years ago

Thanks @etiennebr for the code contribution!

Sorry for my slow response, I've a grant deadline coming up and at night I have a connection which is basically enough for an email but not much more.

From reading through your post (and code) I think this is a good solution. It makes things easier for sure, and indeed provides the sf mapping compatibility. I'm on the slow connection, so I'll try to give things a go tomorrow.

khufkens commented 5 years ago

Hi @etiennebr,

Below the corrections I made to the package.

Major changes include:

Other comments are below, let me know how you feel about this implementation.


Making the frequency a duration object might make it easier to use in code or at least make it easy to parse (remove dashes and specify 1 year rather than yearly). Same thing for the calendar_date of mt_dates() and the proc_date from mt_subset().

I've implemented the conversion of the standard output to a more readable format. I've not included the conversion to seconds as this meta-data is rarely used in analysis and implicit when downloading the time series (which is always continuous, no individual dates can be queried).

I'm biased here, but support for sf objects would be a great addition and augment compatibility with other workflows (every time a lon-lat (or x, y) is required or returned, use an sf object).

This is indeed a valuable addition, if not for visualization purposes alone. I included adapted code you supplied (thanks again for this). I do use a base R implementation in order to limit direct dependencies.

I would suggest that mt_batch_subset should convert (x)yllcorner to at least lon and lat (or even better an sf polygon :) ), rather than the current offsetted lon and lat. It would make it easier for the user to map metadata or, for instance, load a list of sites from a shapefile and find MODIS data with an st_intersection.

See the above. This is implemented through two functions. One to convert the xll / yll corners to lat / lon, subsequently there is function to convert this coordinate, together with other meta-data to an sf polygon. This data is not tagged onto the original data frame to limit complexity and ease of use (saving data as a flat file to disk).

Some typos in the vignette. e.g. "Below an examples are"

Fixed the typo. Thanks for spotting this.

I think that caching (using e.g. memoise::memoise) could speed up some functions and also be lighter on the API provider.

I implemented memoise() on the meta-data queries (those remain largely static between or within subset calls. A significant decrease in load time should be seen (as well as decreased overhead on the server).

I wonder if the tidy list shouldn't rather be a tidy data frame (or tibble). It could either be a nested data frame, but I think it might be hard for users to manipulate these data frames. So maybe a data frame repeating the metadata would be a good option? It repeats some values, but I think it might offer the right balance of convenience. It might also remove the need for a MODISTools class, which I believe would increase compatibility and maintainability.

In the end I decided to do away with the nested structure. The output is now a plain data frame which can easily be save as a flat file to disk.

Consequently the mt_read() and mt_write() functions are removed as a simple read.table() or write.table() will work. I also removed the MODISTools class from the retrieved data after an mt_subset() query.

For some functions, code coverage is relatively low and I think it would be a great improvement (I personally find that API packages are harder to test, anyway). And I'm sure it would help discover some bugs. For instance, I found one just by looking at the coverage results: using !length(site_id %in% sites$siteid) is always FALSE unless no site_id is specified (it doesn't really check if the site_id is in the list).

I fixed the length() issue and tried as much to cover all code. Sadly, server time-outs can not be tested, so this remains.

I would get rid of the try() methods in the tests. In my opinion tests should fail on error with an explicit message and present an explicit expectation. For instance, in test_ancillary_function.r rather than chain a series of conditions such as !inherits(subset,"try-error"), you could use expect_is(subset, "MODISTools"). There is also an expect_identical that could make the comparisons explicit and produce accurate error message that will improve maintainability and contributions.

I updated the testing routines, removing the try() statements.

I have not tested that, but I anticipate that the error messages following the usage of a try will generally be unhelpful. For instance, if a user has issues with an SSL certificate it would receive Your requested timed out or the server is unreachable. The httr error might be more helpful and in that case it might be better not to wrap the function in a try and simply let the error surface to the user. There are also ways to pass the error message in a custom way, for instance by using tryCatch or geterrmessage with try. I changed the way I handle errors after reading the tidyverse guide to error message.

I let the errors surface, without custom error messages. I find think that depending on the circumstances these messages can be good for advanced users, but often confuse beginners.

It's the first time I observe a systematic use of default NULL parameters in the function header. I think removing the default NULL for mandatory parameters might be more robust. You could rather rely on missing() and R to warn the user of missing variables might be lighter on maintenance.

Used missing() throughout, converting other packages slowly as well. Thanks for pointing me toward this function.

It wasn't obvious to me what the core mt_batch_subset function did and what exactly it returned. For instance, what does internal = FALSE did. I now had to provide a data.frame to specify the locations (it would be good to know the expected column names lat and lon). Maybe it would be convenient to extract the preparation of this data.frame from the mt_batch_subset function to help the user decide exactly what to download? Also, I think that sf support might offer a consistent interface to provide locations to functions; it might however be overkill for many usages.

This is mostly a documentation issue I think. The input can either be a data frame, or a csv file using a particular structure (which indeed is not mentioned). I'll hold of on sf integration, as conversion of any sf object (points) to a basic ID, lat, lon dataframe should be easy.

I find it hard to specify a series a specific (non-consecutive) dates (for instance when I had other imagery available) because it always required to use ranges of consecutive dates.

This is a limitation of the API on the ORNL side not the package itself. The use case is also rare. Given the relatively small datasets the overhead is also minor to extract certain dates afterward.

At the time of writing, there are still some good practices that seem apply to the package (goodpractice::gp()).

This is covered.

etiennebr commented 5 years ago

I'm sorry for not getting back earlier on this. @karthik what are your expectations for the next steps? I have several engagements in the next few weeks, and I would like to devote enough time to appropriately review @khufkens's work. I think I can do it before the end of the year, if that sounds right for both of you.

karthik commented 5 years ago

@etiennebr Thanks for the thoughtful response. I think the next steps would be to see if @khufkens has adequately addressed the issues that you've raised. Otherwise you can point out the ones that remain unresolved. If it's easier, you can also open issues on his repo and tag ropensci/onboarding#246 and they will appear as a list here (with automatic updates on when those issues are closed).

karthik commented 5 years ago

@pmnatural Have you had a chance to complete your review yet? Please update on your progress.

khufkens commented 5 years ago

@etiennebr I've fixed both the cell size issue and the error statements (both in mt_dates and mt_subset). I also included the reference to the sinusoidal projection both on the LP DAAC and wikipedia. Both explain the projection and the mention the required parameters. Build is still clean, all should be good.

Thanks for all the feedback!

etiennebr commented 5 years ago

Good job @khufkens! @karthik I would recommend approving this package; @khufkens adressed the issues I raised.

I think there is still room for a higher test coverage rate to help the package in the long run, and sf support could be even easier for the user, but I see these elements as goals to keep in mind while evolving the package. I recommend to approve the package given @khufkens's revisions following the review and that the package is useful and usable in the current form.

khufkens commented 5 years ago

Thanks for the vote of confidence @etiennebr. I acknowledge that sf support would be nice, but I also want to make sure that there is something for people to work with. Since the CRAN (re)release the package has had some good traffic. So the demand for the tool is there.

New features can be added incrementally without breaking things. I think with the feedback the package is also better structured which makes all the difference when moving forward.

karthik commented 5 years ago

Thank you for your review and recommendation @etiennebr!

I'll need a few more days to look everything over but I will make a decision early next week and advise next steps for @khufkens

khufkens commented 5 years ago

Thanks. Take your time, no rush. Previous version is still up and things only got better.

pmnatural commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

khufkens commented 5 years ago

Hi @pmnatural,

Below my response and references to the code changes.

I think this package has to stress, both in the need statement and in the vignettes, its use for extracting point based values from 'MODIS Land Products Subsets' web services of ORNL using its API. It does a great job for someone not familiar with R GIS functionality and greatly eases the job for getting point time series of Modis Land Products.

I'm not sure which exact version you reviewed but both the vignette and DESCRIPTION file now reference to the ORNL API. Although the focus is on point based extractions the querying of regional data prevents me to explicitly state that it is a "point based" approach (the upper limit is 200 x 200 km a sizable footprint, roughly covering Belgium or the state of Massachussets).

Statement can be read here:

Clearly state that point value extraction is for point coordinates given as latitude, longitude or rectangular buffers defined by a width and height in integer kilometers, as testing does not allow for not integer distances.

https://khufkens.github.io/MODISTools/reference/mt_subset.html

By default kilometers are now rounded to prevent errors on this case.

To increase usability the help and vignettes should start telling the user to first check which product is wanted and then the name of the required band, as bands do no follow the ordering of the regular HDF products (or the user simply does not know them).

This has currently been implemented, highlighting the steps needed to gather all required information.

The tidy data.frame brings too many metadata, which is really nearly all the same for every point, even the LL coordinates. Could this be output in a different object, without the need to define a class?

On the suggestion of a previous reviewer a switched from a nested listed to a tidy dataframe. Given current limited data size restrictions and general small datasets I don't see this as an issue. Tidy data allows for quick plotting data across bands, locations, pixels, without looping over complex list structures. It also facilitates concatenating multiple files / bands without the need for nested lists. I would argue that convenience trumps data duplication in this case (as often with long rather than wide or nested formats). Some of the power of this tidy data is provided in the vignette and the ease with which to generate spatial or other subsets or aggregated analysis.

More recent features also include functionality to convert the tidy dataframe to raster stacks. I refer to the vignette to describe this feature and the teaching material by colleague Prof. dr. Jan Verbesselt in his time series analysis course. A shorter version is provided in my github repo. Both are too detailed to serve as general examples.

The batch download could be greatly improved with a function to import coordinates locations from a CSV or even sf objects, or a least include examples of how to do it in the help.

Support for both data frames and CSV files is already in place, as mentioned in the documentation. An example is also included in the documentation and tested for in unit checks.

There is an example for a time series graph of a point location, it would be very useful to have other examples for comparing data from different times or batch downloads of rectangular areas representing different cover types, eg. boxplots.

A new analysis is provided in the vignette using a spatial analysis around the Arcachon Bay in south-west France (random example of a rather varied landscape on a small area).

I hope this addresses all the issues raised!

Kind regards, Koen

pmnatural commented 5 years ago

Final approval (post-review)

Estimated hours spent reviewing:

karthik commented 5 years ago

Congrats @khufkens , your submission has been approved! 🎉 Thank you for submitting and @etiennebr and @pmnatural for thorough and timely reviews. To-dos:

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

karthik commented 5 years ago

Please also add this badge to your README

rOpenSci Peer Review

[![rOpenSci Peer Review](https://badges.ropensci.org/246_status.svg)](https://github.com/ropensci/software-review/issues/246)
stefaniebutland commented 5 years ago

Congratulations @khufkens 🎉. We would indeed love to have a post about MODISTools.

Guidelines for submitting a blog post or tech note are here: https://github.com/ropensci/roweb2#contributing-a-blog-post. We currently have publication slots available June 11 or later.

Happy to answer any questions

etiennebr commented 5 years ago

Good job @khufkens! 🎉

khufkens commented 5 years ago

Thanks all for the feedback! I can do a short blog post, but for something more substantial I'll need some time. If time is no objection I can do this after June 4th.

I also get this error upon transfer of the repo, not sure how to proceed.

Screen Shot 2019-05-20 at 20 24 30
stefaniebutland commented 5 years ago

I can do a short blog post, but for something more substantial I'll need some time.

I can be 100% flexible on publication date, since most important is for you to get what you want out of it. Short, or substantial is your choice based on the message you want to convey. Do you want to propose a date to submit a draft?

khufkens commented 5 years ago

Hi @stefaniebutland, can we peg it on mid June then? This should give me some time to frame things properly with some graphs etc.

karthik commented 5 years ago

Should be fixed now @khufkens

khufkens commented 5 years ago

Thanks!

khufkens commented 5 years ago

Hi @karthik I was notified by someone that the website link of the package is broken. I forgot to change the link to an ropensci account one.

It currently reads (which leads nowhere): https://khufkens.github.io/MODISTools/

but should be https://ropensci.github.io/MODISTools/

Is there someone at your end who can update this so that people can access examples etc?