openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[REVIEW]: rmap: An R package to plot and compare tabular data on customizable maps across scenarios and time #4015

Closed whedon closed 2 years ago

whedon commented 2 years ago

Submitting author: !--author-handle-->@zarrarkhan<!--end-author-handle-- (Zarrar Khan) Repository: https://github.com/JGCRI/rmap Branch with paper.md (empty if default branch): Version: v1.0.5 Editor: !--editor-->@hugoledoux<!--end-editor-- Reviewers: @CamilleMorlighem, @maczokni Archive: 10.5281/zenodo.7085969

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/4cdf462f70681bc335ddebf5868b249c"><img src="https://joss.theoj.org/papers/4cdf462f70681bc335ddebf5868b249c/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/4cdf462f70681bc335ddebf5868b249c/status.svg)](https://joss.theoj.org/papers/4cdf462f70681bc335ddebf5868b249c)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@CamilleMorlighem & @maczokni, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @hugoledoux know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @CamilleMorlighem

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @maczokni

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 2 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @CamilleMorlighem, @maczokni it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 2 years ago

Wordcount for paper.md is 1074

whedon commented 2 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.13 s (246.4 files/s, 106346.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
R                               19           1404           2413           7675
Rmd                              1            367            512            781
Markdown                         6             62              0            200
TeX                              1             18              0            189
YAML                             5             33              0            159
-------------------------------------------------------------------------------
SUM:                            32           1884           2925           9004
-------------------------------------------------------------------------------

Statistical information for the repository '8ee264c2b9dd4723350c4830' was
gathered on 2021/12/20.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Khan                             2           442            442          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s10113-021-01775-1 is OK
- 10.1088/1748-9326/ac046c is OK
- 10.21105/joss.00054 is OK
- 10.32614/RJ-2011-006 is OK
- 10.1007/b106573 is OK
- 10.5194/gmd-12-677-2019 is OK
- 10.5334/jors.292 is OK

MISSING DOIs

- 10.18637/jss.v084.i06 may be a valid DOI for title: tmap: Thematic Maps in R
- 10.32614/rj-2018-009 may be a valid DOI for title: Simple features for R: standardized support for spatial vector data.

INVALID DOIs

- 0.1029/2020EF001970 is INVALID
CamilleMorlighem commented 2 years ago

Hello ! I started the review. I have a few comments already:

  1. About the software paper, I think everything is there. I just noticed a spelling mistake line 17 in the summary (in the article proof) "rmap is desgined" and there are also the few missing/invalid DOIs already noticed by whedon in a previous comment.
  2. I can see in the user guide that the type of license is BSD_2_clause but it does not appear in the license file on the repository, maybe it should be specified there as well ?
  3. About the community guidelines, it is very clear in the user guide how third parties can contribute to the software and report issues or problems but I could not find information about seeking support.
zarrarkhan commented 2 years ago

Hi @CamilleMorlighem,

Thanks for the very useful feedback and comments. Here are the responses to your points:

  1. We fixed the typo on line 17 and also fixed the Missing and incorrect DOIs. The updated document can be found here: http://res.cloudinary.com/hkvhhndao/image/upload/v1640793026/zeldl1tq79mt9xnk82us.pdf or in the repo: https://github.com/JGCRI/rmap/tree/main/paper
  2. We updated the BSD_2_clause license file. Now when you click on the repo LICENSE file (https://github.com/JGCRI/rmap/blob/main/LICENSE) it should show the proper BSD_2_clause license.
  3. We added a support tab in the user guide and setup a discussions page for rmap in the main repo. If users do not want to submit an issue then they can start a discussion instead on the discussions page. We the developers can respond there as well as have community members respond there as well.
CamilleMorlighem commented 2 years ago

Hi @zarrarkhan,

Thank you for your responses !

I have tested the documentation and functionality of the package. The documentation is very clear and structured and very user friendly! I just have the following comments for improvement:

  1. The information link (https://confluence.pnnl.gov/confluence/display/JGCRI/GCAM+Shape+Files) in the help section of the maps mapGCAMBasins and mapGCAMReg32 is not working for me (I reach an error page).
  2. I could not find a statement of need in the user guide or in the documentation on the repository.
  3. I couldn't find either a list of dependencies, but when installing it proposed me to install/update required packages so maybe that's enough?
  4. I found some typos in the user guide:
    • In the section "Subset existing shape", I think you meant unique(shapeSubset@data$subRegion) and not unique(shapeSubset@data$states)
    • In section "Read data from a shapefile", I think you meant rgdal::readOGR(dsn=exampleFolder,layer=exampleFile,use_iconv=T,encoding='UTF-8') instead of rgdal::readOGR(dsn=examplePolyFolder,layer=examplePolyFile,use_iconv=T,encoding='UTF-8').
    • In subsection "Gridded Data Multi-Year Diff", both code snippets return NULL: mapx$map_param_xDiffAbs_2015_1990_KMEANS mapx$map_param_xDiffPrcnt_2015_1990_KMEANS
    • Same with the following code snippets in section "Gridded Data Multi-Scenario Diff":
      mapx$map_param_DiffAbs_scenario3_scenario1_KMEANS mapx$map_param_DiffPrcnt_scenario3_scenario1_KMEANS

About the functionality, everything worked fine and I don't have any performance claims. I only have the following comments:

  1. I did not get the effect of the argument scaleRangeDiffAbs (or scaleRangeDiffPrcnt) in the map function . I could not see what changed in the results when changing the values for these arguments or when simply removing them.
  2. I thought that one very useful function, that is not described in the documentation, is the map_find_df function to find a suitable map for your dataframe. I think it might be very useful when you don't know the package and want to use it for a specific purpose, so maybe it is worth highlighting this function in the user-guide?

Feel free to ignore any comment if you think that's not relevant!

whedon commented 2 years ago

:wave: @maczokni, please update us on how your review is going (this is an automated reminder).

whedon commented 2 years ago

:wave: @CamilleMorlighem, please update us on how your review is going (this is an automated reminder).

zarrarkhan commented 2 years ago

@CamilleMorlighem Thanks for the detailed review and very helpful comments. Addressed as follows:

Improvements

  1. All data sources referring to the restricted (https://confluence.pnnl.gov/confluence/display/JGCRI/GCAM+Shape+Files) have been updated to point to the publicly available zenodo link: https://zenodo.org/record/4688451#.YdMNTmjMJPY
  2. We added statement of need to the repo and user guide landing pages based on the text in the paper.
  3. Dependencies can be found in the https://github.com/JGCRI/rmap/blob/main/DESCRIPTION file and those mentioned in the imports list are required and will be automatically installed when users install the package so it shouldn't be something a user needs to worry about outside the package.
  4. Typos in user guide fixed (Thanks for catching these!):

Functionality

  1. Thanks for noting this. The code needed to be updated to implement the changes. I think with the updated changes (now that it is actually working) you can see the usefulness of this feature (i.e. highlighting different ranges for each type of map as needed). See here: https://jgcri.github.io/rmap/articles/vignette_map.html#scale-range
  2. Added section on map_find_df to user_guide here: https://jgcri.github.io/rmap/articles/vignette_map.html#map-find
maczokni commented 2 years ago

Hello! I've hit a bit of a stumbling block when trying to follow along with some data that isn't the one in the examples (I understand the idea to be that users can upload their own .csv files and this package will allow them to easily map this?)

First note: if the column is not called "subRegion" then the error Error in rmap::map_find_df(mydata) : data must have subRegion columns is immediately thrown. Instead of requiring users to rename their column, could this not be a parameter to specify in the function itself? E.g.: map_find_df(data = mydata, regions = subRegion). I don't think it's in line with the contribution to make things easy to ask people to rename their columns!

But where I'm stuck: once I try to map my data, it will not do this. I have renamed to values, and have checked and it is numeric. Here's my code

library(dplyr)
library(geodaData)
library(rmap)
# get own data to try
ncovr <- geodaData::ncovr

mydata <- ncovr %>% select(NAME, HR60)

map_chosen <- rmap::map_find_df(mydata)
# Error in rmap::map_find_df(mydata) : data must have subRegion columns.

mydata <- ncovr %>% select(NAME, HR60) %>% rename(subRegion = NAME)
map_chosen <- rmap::map_find_df(mydata)
rmap::map(map_chosen)
# Here the value column is not used, and no error is thrown, I just get a meaningless map

mydata <- ncovr %>% select(NAME, HR60) %>% rename(subRegion = NAME, value = HR60)
map_chosen <- rmap::map_find_df(mydata)
rmap::map(map_chosen)
#  OK it might not be an issue with naming, it just doesn't work...

rmap::map(map_chosen, fillColumn = value)
# Error in rmap::map_plot(color = color, lwd = lwd, legendType = legendType,  : 
#  object 'value' not found

# the example from the vignette works
data = data.frame(subRegion = c("CA","FL","ID","MO","TX","WY"),
                  value = c(5,10,15,34,2,7))
rmap::map(data)

# both the example and my own data's value columns contain numeric variables
class(mydata$value)
class(data$value)

# From the lack of error messages and the lack of further detail in the vignette, I cannot get it to work. 

Any thoughts?

zarrarkhan commented 2 years ago

Hi @maczokni.

Thanks for your review and comments. Please see our responses below:

# Update rmap with latest changes
devtools::install_github("JGCRI/rmap")
library(rmap)
  1. Required Column names (subRegion and Value): We have highlighted how users can identify their own column names in rmap using the subRegCol and valueCol arguments. (see end of this section: https://jgcri.github.io/rmap/articles/vignette_map.html#inputs)

  2. Example data: Thanks for trying rmap with your example data. So we updated a few things and have some comments on this:

    • ncovr <- geodaData::ncov gives an sf object with geometries. rmaps was intially unable to process this because it was not a pure R dataframe. Code has been updated to convert an sf object to a pure dataframe.
    • mydata <- ncovr %>% select(NAME, HR60) results in a dataframe with a list of counties and corresponding data. If you sort by NAME you will see multiple entries for the same subRegion. rmap (or any other mapping package for that matter) would not know how to deal with this repeated data.
      dplyr::select(NAME,HR60) %>%
      as.data.frame() %>%
      dplyr::arrange(NAME) %>%
      head(10)
  3. With the updates you can run your example as follows: Note that in general, county data is problematic because there are multiple counties with the same name in different states. In order to be more specific you can append your county names with an underscore and relevant State initials.

    
    library(dplyr)
    library(geodaData)
    library(devtools)
    library(rmap)

ncovr <- geodaData::ncovr

Subset first 10 rows to avoid repeated subRegions

mydata <- ncovr %>% dplyr::select(NAME,HR60) %>% head(10); mydata

Will give you the relevant plot but multiple counties

rmap::map(mydata, subRegCol = "NAME", # Set your subRegion column valueCol = "HR60") # Set your value column

In the following extended example you can see all the multiple counties labelled.

rmap appends the State Initials for you as it recognizes the counties. Although all the counties are given the same value (which is probably not the intention here).

You can also see some of the other features of rmap at play here

such as underLayer, zoom, labels etc.

rmap::map(mydata, subRegCol="NAME", valueCol="HR60", labels = T, labelSize = 3, labelRepel = T, underLayer = rmap::mapUS49, zoom=-2)



Please let us know if this helps clarify some of the issues you were having and if you have additional concerns/comments.
maczokni commented 2 years ago

Hi @zarrarkhan thanks for the update - sorry just getting back to this now but I have a major concern - you are right there are duplicates, as there are USA counties which share the same name. This means that the county names is not actually an appropriate column on which to join. This means that if I do want to join the full dataset (rather than the 10 subset you used) then I cannot do this with the package - is that right? And this is not just a quirk of the dataset I'm using to test, your own data (the inbuilt map: mapUS49County has duplicate values of the subRegion column):

# devtools::install_github("JGCRI/rmap")
library(rmap)
library(geodaData)
library(dplyr)
library(sf)

# get data for testing
data("ncovr")
# check any duplicate names in the county inbuilt map
thing <- rmap::mapUS49County
length(thing@data$subRegion)
length(unique(thing@data$subRegion))
as.data.frame(table(thing@data$subRegion)) %>% filter(Freq > 1)

Considering this is a primary function of the package, do you think you could take some more time to think about this problem? As I'm sure you are familiar, usually for each geography you will look for unique codes in the dataframes on which you can confidently join. Here in the UK we might use for example local authority codes or output area codes. In the US it seems to be state and county codes for example in this case, which is in the ncovr dataset as 'FIPS'.

I had a play with the data of the mapUS49County inbuilt map as an example, and by pasting the state and county code you can create a FIPS column to join.

# combine state code and county code for FIPS
thing <- rmap::mapUS49County
thing@data$FIPS <- paste0(thing@data$STATEFP, thing@data$COUNTYCODE)
joined <- left_join(ncovr, thing@data, by = c("FIPS" = "FIPS"))

# check how many would not be joined
nrow(joined %>% filter(is.na(source)))

# check these ones
not_joined <- joined %>% filter(is.na(source))

When we do that only 3 counties are not joined - i.e. only 3 would not be linked to the inbuilt map. You might want to consider printing a warning to users when this is the case, maybe with the name of the ones which did not join.

However this is only one case for one data set - I think it might be useful to think about some use cases of external data, how do these codes commonly look like, and have a few join columns on which users' data can be joined to the inbuilt geometries.

Hope this makes sense, let me know if not I can try to elaborate more!

zarrarkhan commented 2 years ago

@maczokni Thanks for your comments and notes. We have addressed your comments as follows:

Please let us know if this addresses some of your concerns or if you have additional comments.

hugoledoux commented 2 years ago

@CamilleMorlighem I see that the authors fixed your issues (statement of needs among others, see there: https://github.com/openjournals/joss-reviews/issues/4015#issuecomment-1004272488) but you didn't tick the boxes above. Can you let us know if you are now satisfied please?

CamilleMorlighem commented 2 years ago

@hugoledoux thank you for noticing it !

@zarrarkhan thank you for these improvements and answers to my comments. I think I'm done with the review. Rmap is a very nice package! I'm probably gonna use it in the future.

zarrarkhan commented 2 years ago

@hugoledoux Thanks for following up. Wondering if there are any additional comments we can address or get an update from the reviewers on the comments we addressed in https://github.com/openjournals/joss-reviews/issues/4015#issuecomment-1019419000.

hugoledoux commented 2 years ago

@maczokni could you check the updates from the authors (https://github.com/openjournals/joss-reviews/issues/4015#issuecomment-1019419000)? And tell us if you're satisfied?

maczokni commented 2 years ago

My apologies for the delay. At this moment, I'm afraid that I'm still not clear what the instructions then are for joining data. My understanding is that the USP of this package is that it provides an easy way to map things for someone who is not super familiar with GIS. This would require some work on the part of the package to have a set of values on which it could join geometries and do some searching in place of the sure. One commonly used for the USA for example is FIPS. The authors suggest "The datasets also contain additional columns such as FIPS and other details that were part of the original datasets to ensure unique entries." - however you cannot join using FIPS, or if you can I don't see this detailed on the documentation. If I try to specify the FIPS column as the "subRegCol", it does not work:

# devtools::install_github("JGCRI/rmap")
library(rmap)
library(geodaData)
library(dplyr)
library(sf)

# get data for testing
data("ncovr")

mydata <- ncovr %>% select(NAME, STATE_NAME, FIPS, HR60) %>% st_drop_geometry()

map_chosen <- rmap::map(mydata, subRegCol = 'FIPS', valueCol = 'HR60')
[1] "Starting map..."
[1] "Map data table written to /cloud/project/outputs//map_param.csv"
Error in get(paste(gsub("subReg", "map", gsub("Alt", "", subRegChosen)),  : 
  object 'mapGCAMLanddf' not found

The authors suggest: " Each of our datasets provides the source for the underlying data which can be compared to user data if needed. For example the county data is from "https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html"." and that "Some degree of data cleaning and checks is expected with user data."

For me to join the ncovr data (just one example) I have to get a lookup table from state name to state abbreviation, join it to my data set, then create a new column which has the same strings as the subRegCol expects, and then I can request a map, however I still run into an error:

state_lookup <- data.frame(state_name = c("Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware",
                                          "Florida","Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana",
                                          "Maine","Maryland","Massachusetts", "Michigan","Minnesota","Mississippi","Missouri","Montana",
                                          "Nebraska","Nevada","New Hampshire", "New Jersey","New Mexico","New York","North Carolina",
                                          "North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Rhode Island","South Carolina","
                                          South Dakota","Tennessee","Texas","Utah","Vermont","Virginia","Washington","West Virginia",
                                          "Wisconsin","Wyoming"), 
                           state_code = c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY",
                                     "LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK",
                                          "OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"))

mapthis <- dplyr::left_join(mydata, state_lookup, by = c("STATE_NAME" = "state_name"))

mapthis$subRegCol <- paste0(mapthis$NAME, "_",mapthis$state_code)

map_chosen <- rmap::map(mapthis, subRegCol = 'subRegCol')

# [1] "Starting map..."
# Coordinate system already present. Adding new coordinate system, which will replace the existing one.
# Error in `check_required_aesthetics()`:
#   ! geom_polygon requires the following missing aesthetics: x and y
# Run `rlang::last_error()` to see where the error occurred.
# There were 32 warnings (use warnings() to see them)
# Warning messages:
#   1: Unknown or uninitialised column: `lon`.
# 2: In min(datax1$lon) : no non-missing arguments to min; returning Inf
# 3: Unknown or uninitialised column: `lon`.
# 4: In min(x, na.rm = na.rm) : no non-missing arguments to min; returning Inf
# 5: In max(x, na.rm = na.rm) : no non-missing arguments to max; returning -Inf
# ...etc...

So this is where my sticking point is, that if the USP is that this is so much easier than getting your own shapefile, this process should be more smooth. Now maybe I'm missunderstanding, and in that case I'm happy to hear different, but at the moment, but if the advice is that the user has to go check out the source data anyway, I don't see why not to simply get the shapefile from the US Census bureau and use tmap to make a quick map.

EDIT

Actually the tigris package makes this even easier:

tg_counties <- tigris::counties()
mapthis <- dplyr::left_join(tg_counties, mydata, by = c("GEOID" = "FIPS"))
tmap::qtm(mapthis, fill = "HR60")

I understand there is further functionality, and I am sure there is more added value there, but at the moment just struggling to get to that point...! The solutions like selecting only 10 of the 3000+ counties required aren't really useful as they aren't realistic use cases.

Happy to be pointed out that I'm missing something super obvious, but just really would like to get a working example with a real world data set before I can sign off on this.

zarrarkhan commented 2 years ago

@maczokni

Thank you for your response. Here is a working example with your dataset after removing out-of-date FIPS as we discussed above. I tried your counter-examples and ran into some issues for which I have shared more details below.

Working Example with your data

data("ncovr")
mydata <- ncovr %>% select(NAME, STATE_NAME, FIPS, HR60) %>% st_drop_geometry() %>%
  dplyr::left_join(rmap::mapUS52County@data, by="FIPS") %>%
  dplyr::mutate(value=HR60) %>%
  dplyr::filter(!is.na(subRegion)) # Remove non-existent FIPS 51560 (Clifton Forge), FIPS 12025 (Dade)
rmap::map(data=mydata)

Missing FIPS in ncovr data You are using a specific dataset "ncovr" which has FIPS for counties that do not exist any more as we described in #4015: (Source: https://www.ddorn.net/data/FIPS_County_Code_Changes.pdf)

Your Counter Example 1

download.file("https://www2.census.gov/geo/tiger/GENZ2018/kml/cb_2018_us_county_20m.zip","counties.zip")
unzip("counties.zip")
counties_geojson <- sf::st_read("cb_2018_us_county_20m.kml")
mapthis <- dplyr::left_join(counties_geojson, mydata, by = c("GEOID" = "FIPS"))
tmap::qtm(mapthis, fill = "HR60")
Error: Join columns must be present in data.
x Problem with `GEOID`.

Your Counter Example 2

tg_counties <- tigris::counties()
mapthis <- dplyr::left_join(tg_counties, mydata, by = c("GEOID" = "FIPS"))
tmap::qtm(mapthis, fill = "HR60")

Updates made and additional comments

Working Example with your data after updates

# devtools::install_github("JGCRI/rmap") # To update to latest version
library(rmap)
library(geodaData)
library(dplyr)
library(sf)

data("ncovr")
mydata <- ncovr %>% select(NAME, STATE_NAME, FIPS, HR60) %>% st_drop_geometry() %>%
  dplyr::left_join(rmap::mapUS52County@data, by="FIPS") %>%
  dplyr::mutate(value=HR60)
rmap::map(data=mydata)

Please let us know if the example above is working for you and if you have any additional comments or issues.

maczokni commented 2 years ago

Hiya I'll give this a go later thanks but it's still just one example of the counties and it still requires the user to do data cleaning, so still not sure about the target audience being non GIS users.

However in the meantime I wanted to share two concerns with the code:

1 - the package functions producing character vectors are created using print() rather than messages using message()? It's going to be an issue for users because it defeats things like suppressMessages() and knitr chunk options when the package is used in a Rmarkdown document. It would be even better if you could use rlang::inform() because it has nice formatting for their long messages while still playing nicely with functions that expect base::message().

2 - you should consider adopting sf for the internal workings rather than sp. Roger Bivand (who wrote the sp package in the first place) recommends at https://doi.org/ghnwg3 that new packages use sf rather than sp and says that packages that continue to depend on SP will have to "hope" it continues to be maintained. I also note that rmap depends on rgdal and rgeos, both of which will be retired by the end of 2023 (as noted in bold type at the top of the NEWS.md pages of both packages) and neither of which are required if you use sf instead.

zarrarkhan commented 2 years ago

@maczokni

Thank you for the helpful comments and suggestions. We have addressed these in commit da9c41574820679fae9e0cb4871cf014edf2213c with responses below:

maczokni commented 2 years ago

I'm really not trying to be difficult, but I believe the package requires extensive testing anyway. I just tried again with a different data set to map Local Authority areas in London and this is the result:

map_param_KMEANS

Here's the code:

library(rmap)
library(tidyverse)
library(readxl)
library(janitor)

# read in new test data
download.file("https://data.london.gov.uk/download/gcse-results-by-borough/a8a71d73-cc48-4b30-9eb5-c5f605bc845c/gcse-results.xlsx", 
              destfile = "gcse-results.xlsx")
gcse_results <- read_xlsx("gcse-results.xlsx", sheet = "2020-21") 

# clean up test data
colnames <- paste0(gcse_results[1,], gcse_results[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results) <- colnames
gcse_results <- gcse_results %>%
  clean_names() %>% 
  slice(4:36) %>% 
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

# try to map using Local Authority name
my_map <- rmap::map(gcse_results,
          subRegCol = "area",
          valueCol = "number_of_pupils_at_the_end_of_key_stage_4")

Again I know you can manually find a better fitting map, but then once again I'm not certain the contribution over finding a map manually in other ropensci packages like rnaturalearth and using tmap:

library(rnaturalearth)
library(tmap)

london_boroughs <- ne_states(country = "United Kingdom", returnclass = "sf") %>% 
  filter(region == "Greater London")

gcse_results_sf <- left_join(london_boroughs, gcse_results, 
                             by = c("name" = "area")) 

qtm(gcse_results_sf, fill = "number_of_pupils_at_the_end_of_key_stage_4")

which returns:

tmap_london

Or if I want the legend fixed:

qtm(gcse_results_sf, fill = "number_of_pupils_at_the_end_of_key_stage_4",  fill.title="Pupils at KS4") 

tmap_w_title

In sum: I think that there is some more extensitve and thorough testing required than something that should be done by reviewers. I'm just spot checking here and keep coming across issues - so at this point I would advise some more thorough testing of the package overall by the authors, before I would be happy to sign off on this.

zarrarkhan commented 2 years ago

@maczokni of course you are not being difficult and on the contrary we appreciate you taking the time and effort to give your detailed feedback. We will look at the issues you have pointed out and a more through test of the existing maps and get back to you.

zarrarkhan commented 2 years ago

@maczokni @hugoledoux

Thank you for your earlier comments. We have tried to address all of your comments and made the following changes. Please, pull in the latest changes from main to test these and let us know if you have any additional comments.

Load Libraries

library(rmap); library(geodaData); library(dplyr); library(sf); library(tidyverse)
library(readxl); library(janitor)

Example 1

data("ncovr")
mydata <- ncovr %>%
  dplyr::select(NAME, STATE_NAME, FIPS, HR60) %>%
  sf::st_drop_geometry() %>%
  dplyr::left_join(rmap::mapUS52County, by="FIPS")
rmap::map(data=mydata, valueCol = "HR60", legendTitle = "HR60")

map_param_KMEANS

Example 2

download.file("https://data.london.gov.uk/download/gcse-results-by-borough/a8a71d73-cc48-4b30-9eb5-c5f605bc845c/gcse-results.xlsx", destfile = "gcse-results.xlsx")
gcse_results <- readxl::read_xlsx("gcse-results.xlsx", sheet = "2020-21")

# clean up test data
colnames <- paste0(gcse_results[1,], gcse_results[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results) <- colnames
gcse_results <- gcse_results %>%
  clean_names() %>%
  slice(4:36) %>%
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

# Plot
my_map <- rmap::map(gcse_results,
                    subRegCol = "area",
                    valueCol = "number_of_pupils_at_the_end_of_key_stage_4",
                    legendTitle = "Pupils at KS4")

map_param_KMEANS

Example 3

# Test Covid data
# Our World in Data JHU https://github.com/owid/covid-19-data/tree/master/public/data
# State vaccination data: https://github.com/owid/covid-19-data/raw/master/public/data/vaccinations/us_state_vaccinations.csv
covid_data <- read.csv(url("https://github.com/owid/covid-19-data/raw/master/public/data/vaccinations/us_state_vaccinations.csv"))%>%
  tibble::as_tibble() %>%
  dplyr::select(subRegion=location,date,value=people_vaccinated_per_hundred) %>%
  dplyr::mutate(subRegion = dplyr::if_else(subRegion=="New York State","New York",subRegion)) %>%
  dplyr::filter(date == max(date),
                subRegion %in% rmap::mapUS49$subRegionAlt); covid_data

rmap::map(covid_data,
          title=paste0("People Vaccinated per hundered ",max(covid_data$date)),
          legendTitle = "People per 100")

map_param_KMEANS

Examples from Documentation: https://jgcri.github.io/rmap/articles/vignette_map.html

zarrarkhan commented 2 years ago

Hi @hugoledoux,

Just wanted to check in to see if there were any updates since we submitted our modifications in https://github.com/openjournals/joss-reviews/issues/4015#issuecomment-1074078464 two weeks ago.

Thanks!

hugoledoux commented 2 years ago

@maczokni could you please have one look at the updated material here?

hugoledoux commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

zarrarkhan commented 2 years ago

@hugoledoux Thank you for generating the proof. We have read the proof and corrected a few typos we found.

In addition we see that the referencing style seems inconsistent but it is not clear how to adjust this. For example: In the article proof linked above on line 22 we have the following:

publications (Khan et al., 2021; Wild, Khan, Clarke, et al., 2021; Wild, Khan, Zhao, et al., 2021)

Which refer to the following references:

The first citation only references a single author while the rest list three authors.

Apart from this we have checked the proof and pushed all changes in commit: https://github.com/JGCRI/rmap/commit/6730a68880e6673584d67b45e66820ee1fb86d9a

Please let us know what we can do next.

Thanks!

zarrarkhan commented 2 years ago

Hi @hugoledoux, Just checking in to see if we can provide any more information or help move things ahead with the review. Thanks for your management of the process! Best Zarrar

hugoledoux commented 2 years ago

Sorry @zarrarkhan I am on holidays. I will be back in 2w.

I was hoping that @maczokni would finalise her review in the meantime, if we haven't heard from her then I'll see how to proceed.

Hope you understand.

zarrarkhan commented 2 years ago

@hugoledoux No worries. That sounds good. Enjoy the vacations. Best Zarrar

maczokni commented 2 years ago

Hi apologies for delay, I will try to get to this this week.

maczokni commented 2 years ago

Hi, I'm having trouble downloading the new version:

> devtools::install_github("JGCRI/rmap")

Downloading GitHub repo JGCRI/rmap@HEAD
Error in utils::download.file(url, path, method = method, quiet = quiet,  : 
  download from 'https://api.github.com/repos/JGCRI/rmap/tarball/HEAD' failed

I also tried with remotes instead of devtools but no luck. I can download other repos no problem with the same method (e.g. devtools::install_github("mpjashby/sfhotspot") downloads the package no problem).

Here's my session info:

R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.3    usethis_2.1.6     devtools_2.4.3    pkgload_1.2.4     R6_2.5.1          rlang_1.0.2      
 [8] fastmap_1.1.0     tools_4.2.0       pkgbuild_1.3.1    sessioninfo_1.2.2 cli_3.3.0         withr_2.5.0       ellipsis_0.3.2   
[15] remotes_2.4.2     rprojroot_2.0.3   lifecycle_1.0.1   crayon_1.5.1      brio_1.1.3        processx_3.5.3    purrr_0.3.4      
[22] callr_3.7.0       fs_1.5.2          ps_1.7.0          curl_4.3.2        testthat_3.1.4    memoise_2.0.1     glue_1.6.2       
[29] cachem_1.0.6      compiler_4.2.0    desc_1.4.1        prettyunits_1.1.1

I suspect it will be to do with the package size and therefore time it takes to download. If I download via browser the tarball from the URL, the file is 120.9 (!!!) MB. Please see here about size and limitations for packages on cran: https://thecoatlessprofessor.com/programming/r/size-and-limitations-of-packages-on-cran/

I am happy to continue the review having installed from the package archive file I've downloaded, but it might be something to consider to reduce the size (again referring to the cran guidance).

zarrarkhan commented 2 years ago

@maczokni Thanks for this comment. We are aware of the size issue and mention it in our paper. We also had a previous version with the data separated from the package to meet CRAN guidelines. However, since all the functions in rmap depend on having access to the built-in maps it would require the map sets to be downloaded each time. Having all download during the intial install made this a one-time effort. We realize that making this choice means we won't be publishing rmap to CRAN. From the paper:

"While built-in maps increase the size of the package, having direct access to these allows for automated searching and quick deployment of relevant shapefiles without the need to download any data."

The total package size has not changed size since you first reviewed the package so it is strange that you are getting this download error now. We will check further to investigate these restrictions and can re-consider separating the data from the main package.

Best

Zarrar

maczokni commented 2 years ago

Thanks for this! It might be that I initially downloaded at work with better internet access. I am not an expert in packaging data into packages, but is there not a solution to appeal to a database? The package sizes for rnaturalearth and rnaturalearthdata are both much smaller (<1mb and 3mb) and from what I understand perform roughly the same functionality of finding a relevant shapefile from a query? It might introduce a separate step I guess (e.g. find all which meet some higher-level category e.g. ask user to specify country first, get those relevant geometries, and then match on the relevant "subRegion" column?). Anyway like I said this is not my area of expertise, so I'm not sure what solution to suggest.

Another thing: when you summon the help file for the package it just links to the github repo and github pages. The link for the cheatsheet is broken (it says https://github.com/JGCRI/rmap/blob/main/rmapCheatsheet.pdf instead of https://jgcri.github.io/rmap/cheatsheet.pdf)

Overall, I remain unconvinced that this is easy for those who are new to GIS, or easier than existing solutions. I don't think (at least to me) it is clear enough from the documentation how to go about applying this to my own data, which I believe is meant to be the strength of this package. I say this after running into more issues, this time trying to evoke the automatic "Multi-Year-Class-Scenario-Facet". The example in the documentation produces a dataframe with repeat values in the subRegion column, but when I try with the gcse data we used above I get an error about this (presumably when the map is being summoned):

library(rmap)
library(tidyverse)
library(readxl)
library(janitor)
library(sf)

# read in new test data
download.file("https://data.london.gov.uk/download/gcse-results-by-borough/a8a71d73-cc48-4b30-9eb5-c5f605bc845c/gcse-results.xlsx", 
              destfile = "gcse-results.xlsx")
gcse_results <- read_xlsx("gcse-results.xlsx", sheet = "2020-21") 
gcse_results_2 <- read_xlsx("gcse-results.xlsx", sheet = "2019-20") 

# clean up test data
colnames <- paste0(gcse_results[1,], gcse_results[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results) <- colnames
gcse_results <- gcse_results %>%
  clean_names() %>% 
  slice(4:36) %>% 
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

colnames <- paste0(gcse_results_2[1,], gcse_results_2[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results_2) <- colnames
gcse_results_2 <- gcse_results_2 %>%
  clean_names() %>% 
  slice(4:36) %>% 
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

gcse_results_joined <- rbind(gcse_results %>% mutate(year = 2020), 
                          gcse_results_2 %>% mutate(year = 2019))

rmap::map(data = gcse_results_joined,
           subRegCol = "area",
           valueCol = "number_of_pupils_at_the_end_of_key_stage_4",
           ncol=4,
           background = T)

Starting map...
Map data table written to /Users/user/Desktop/teaching/crime_mapping_discussions/outputs//map_param.csv
Error in rmap::map(data = gcse_results_joined, subRegCol = "area", valueCol = "number_of_pupils_at_the_end_of_key_stage_4",  : 
  Input data data has multiple values. Please check your data.

I assumed we could get around this by extracting the map first, and using this. To do this try map_find() function. This function has not been changed like the rmap() function, and so still requires a renaming of the column to subRegion:

> map_find(gcse_results_joined, subRegCol = "area")
Error in map_find(gcse_results_joined, subRegCol = "area") : 
  unused argument (subRegCol = "area")
> map_find(gcse_results_joined)
Error in map_find(gcse_results_joined) : data must have subRegion columns.

But even after creating a "subRegion" column in my data, this map doesn't help me:

gcse_results_joined$subRegion <- gcse_results_joined$area

thing <- map_find(gcse_results_joined)

rmap::map(data = gcse_results_joined,
          underLayer = thing
          background = T)

Specifying the value column summons the same error:

> rmap::map(data = gcse_results_joined,
+           underLayer = thing,
+           background = T,
+           valueCol = "number_of_pupils_at_the_end_of_key_stage_4")
Starting map...
Map data table written to /Users/user/Desktop/teaching/crime_mapping_discussions/outputs//map_param.csv
Error in rmap::map(data = gcse_results_joined, underLayer = thing, background = T,  : 
  Input data data has multiple values. Please check your data.

Various specifications didn't help. In the end the only solution was to rename my data in the image of the data specified in the example:

data = data.frame(subRegion = gcse_results_joined$area,
                  year = gcse_results_joined$year,
                  value = gcse_results_joined$number_of_pupils_at_the_end_of_key_stage_4)

rmap::map(data = data,
          ncol=4,
          background = T)

So I think I'm misunderstanding the process here, and I would say I'm pretty comfortable with GIS and R, having published, and actively teaching in this space, but feel quite lost following the documentation with my own data. It is possible that to someone from a different background these processes are more intuitive, perhaps there is a need for this in the domain of the authors, where people have a certain approach/ understanding which is required to use this package. In that case I think this should be specified in the documentation.

Basically the behaviour I would expect would be something like:


rmap::map(data = gcse_results_joined,
          subRegion = area,
          year = year,
          value = number_of_pupils_at_the_end_of_key_stage_4,
          ncol=4,
          background = T)

My comment once again is that if the audience is people who have no GIS knowledge, and for some reason don't want to use rnaturalearth+tmap solutions to map their data, then:

1 - all functions should allow to specify where the relevant columns are in the own data (e.g. subRegion = "col", etc) 2 - the documentation should be improved with some clear instruction of how to apply the functions to an external data sets.

zarrarkhan commented 2 years ago

Thanks for the detailed comments. Please see responses below:

-Package Size: We did consider (and tested) hosting all the maps in a separate data repository. However this required the maps to be downloaded each time a function was called and for now we decided to provide all maps with the initial installation. rnaturalearth comes with some pre-loaded maps (ne_countries, ne_states) but does also use ne_download for other maps and resolutions (https://www.rdocumentation.org/packages/rnaturalearth/versions/0.1.0/topics/ne_download)

-Help File and Links: We updated the help file with a short package description and also checked that all the links work.

  1. Github: https://github.com/JGCRI/rmap
  2. Webpage: https://jgcri.github.io/rmap/
  3. Cheatsheet: https://jgcri.github.io/rmap/cheatsheet.pdf

-Your Example: Your example ran into a case which we had not come across in our tests yet i.e. both an x column and a year column present in the data. rmap plots the time dimension using an x column and if the data comes with a year column it assumes this as the x column. In your case, you had both an x column and a year column and so the x column was used. Since the x column was all NAs it lead to the duplicated rows message. We have updated the package so this situation is handled with an appropriate message to inform the user. We also updated the script so the user can specify the key arguments directly and renamed them to be more intuitive as you suggested (directly using subRegion, value, x, instead of subRegCol, valueCol). We use x in case the time dimension is something other than year, such as week or month etc. This should work now:

library(rmap)
library(tidyverse)
library(readxl)
library(janitor)
library(sf)

# read in new test data
download.file("https://data.london.gov.uk/download/gcse-results-by-borough/a8a71d73-cc48-4b30-9eb5-c5f605bc845c/gcse-results.xlsx", 
              destfile = "gcse-results.xlsx")
gcse_results <- read_xlsx("gcse-results.xlsx", sheet = "2020-21") 
gcse_results_2 <- read_xlsx("gcse-results.xlsx", sheet = "2019-20") 

# clean up test data
colnames <- paste0(gcse_results[1,], gcse_results[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results) <- colnames
gcse_results <- gcse_results %>%
  clean_names() %>% 
  slice(4:36) %>% 
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

colnames <- paste0(gcse_results_2[1,], gcse_results_2[2,])
colnames <- gsub("NA", "", colnames)
names(gcse_results_2) <- colnames
gcse_results_2 <- gcse_results_2 %>%
  clean_names() %>% 
  slice(4:36) %>% 
  mutate(number_of_pupils_at_the_end_of_key_stage_4 = as.numeric(number_of_pupils_at_the_end_of_key_stage_4))

gcse_results_joined <- rbind(gcse_results %>% mutate(year = 2020), 
                          gcse_results_2 %>% mutate(year = 2019))

rmap::map(data = gcse_results_joined,
           subRegion = "area",
           value = "number_of_pupils_at_the_end_of_key_stage_4")

-All functions should allow custom columns: We have adjusted the script so that users can specify their own columns for all the key arguments (value, subRegion, scenario, class, x) for all functions. We add a section in the documentation on this: https://jgcri.github.io/rmap/articles/vignette_map.html#customcolumns

-Advantages of using rmap: So the true utility of rmap comes in comparing data across scenarios, classes and time. The examples below using your data demonstrate some of these capabilities which would not be so simple using existing software:

Multiple ggplots based on data provided saved as a list which can be edited later

rmap::map(data = gcse_results_joined,
           subRegion = "area",
           value = "number_of_pupils_at_the_end_of_key_stage_4") -> mapx

mapx$map_param_KMEANS
mapx$map_param_MEAN_KMEANS
mapx$map_param_MEAN_KMEANS + ggplot2::theme_dark()

map_param_KMEANS map_param_MEAN_KMEANS map_param_MEAN_KMEANS_dark

Compare across time

rmap::map(data = gcse_results_joined,
           subRegion = "area",
           value = "number_of_pupils_at_the_end_of_key_stage_4",
           xRef = 2019,
           save =F, show =F) -> mapx

mapx$map_param_KMEANS_xDiffPrcnt

See the percentage change from 2019 to 2020

map_param_KMEANS_xDiffPrcnt

Compare Different Classes (In this case hypothetical genders)

gcse_results_joined_classes <- gcse_results_joined %>%
  dplyr::mutate(gender = "girls") %>%
  dplyr::bind_rows(gcse_results_joined %>%
                     dplyr::mutate(gender="boys",
                                   number_of_pupils_at_the_end_of_key_stage_4 =
                                     number_of_pupils_at_the_end_of_key_stage_4*runif(nrow(gcse_results_joined))))

rmap::map(data = gcse_results_joined_classes,
               class = "gender",
               subRegion = "area",
               xRef = 2019,
               value = "number_of_pupils_at_the_end_of_key_stage_4", save=F, show=F) -> mapx

mapx$map_param_KMEANS
mapx$map_param_KMEANS_xDiffPrcnt

map_param_KMEANS

See how the value changed for each gender between 2019 to 2020

map_param_KMEANS_xDiffPrcnt

Compare Scenarios and classes (In this case hypothetical income_classes)

gcse_results_joined_scenario <- gcse_results_joined %>%
  dplyr::mutate(income_level = "high_income") %>%
  dplyr::bind_rows(gcse_results_joined %>%
                     dplyr::mutate(income_level="low_income",
                                   number_of_pupils_at_the_end_of_key_stage_4 =
                                     number_of_pupils_at_the_end_of_key_stage_4*runif(nrow(gcse_results_joined))))

rmap::map(data = gcse_results_joined_scenario,
          scenario = "income_level",
          subRegion = "area",
          xRef = 2019,
          scenRef = "low_income",
          value = "number_of_pupils_at_the_end_of_key_stage_4",
          save=F,show=F) -> mapx

mapx$map_param_KMEANS
mapx$map_param_KMEANS_DiffPrcnt

map_param_KMEANS

See how the value changed for each gender and income class between 2019 and 2020

map_param_KMEANS_xDiffAbs

zarrarkhan commented 2 years ago

@hugoledoux @maczokni Hi, Just wanted to check in on any further feedback related to our response (https://github.com/openjournals/joss-reviews/issues/4015#issuecomment-1163520937) submitted 22 June. Thanks!

maczokni commented 2 years ago

Hi All,

I read your response about the main contribution, but now I wonder if the paper needs updating then? You state in the comment above "So the true utility of rmap comes in comparing data across scenarios, classes and time" That is only 1 out of 4 points raised in the original paper. Would it be worth to focus the package and paper to this contribution?

If you must keep the table to map part, for the point "State of the field: Do the authors describe how this software compares to other commonly-used packages?" I would like to see discussion v rnaturalearth and other approaches to detail in what way this compares. I would like to be convinced in the paper that there is an added value here.

My other sticking point was "Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines" specifically the point: "enables some new research challenges to be addressed or makes addressing research challenges significantly better (e.g., faster, easier, simpler)." The data table to map being faster/easier/simpler was something I was not convinced about, but with the long back and forth we are at a point where I no longer want to be carrying out extensive testing on behalf of the authors. On the other hand if the key contribution is now this ease for the comparison around the different scenarios, this should be emphasised more.

At time of submission this package was not robustly tested enough. I've given a hefty amount of my time to testing it and making recommendations which I think were quite substantial. At this point I'm not willing to give more time to keep finding new test cases, but would like to be convinced that the authors themselves had gone through the trouble of doing this. Every time it fell over for me was a case where this could/should have been flagged by extensive internal testing. I would really like the authors to show they have done this themselves, instead of relying on the reviewers (actually I'm not sure the other reviewer did test this? so reviewer) to do so.

So to sum up: on the two outstanding points above in the paper (1 - State of the field, and 2 - Substantial scholarly effort) - if the authors make a case in updates to the paper that argue for these points, and demonstrate some significant testing, I will then tick these boxes and pass back to the editor @hugoledoux whether he is happy for it to go ahead.

hugoledoux commented 2 years ago

thanks @maczokni for the update, this is clear and I agree with what you wrote.

Reviewers are expected indeed to test the package, but reviewers are indeed not debuggers. That being said, there are often bugs in software and one can't expect all software to be perfect.

@CamilleMorlighem a point was raised about testing, and since you did the review a while ago (before changes to the package were made) could you do them again perhaps?

zarrarkhan commented 2 years ago

@hugoledoux @maczokni

Thank you for your latest comments and the time you put into your thorough review. Your comments are addressed below and in the latest version of the paper draft: https://res.cloudinary.com/hkvhhndao/image/upload/v1657942383/ew2vgphg2dfqi6qy5n0t.pdf which can also be seen at: https://github.com/JGCRI/rmap/blob/main/paper/paper.md


Comment 1: If you must keep the table to map part, for the point "State of the field: Do the authors describe how this software compares to other commonly-used packages?" I would like to see discussion v rnaturalearth and other approaches to detail in what way this compares. I would like to be convinced in the paper that there is an added value here.

Response: The table to map is one of several functionalities of rmap that makes it easy to use. We agree that it is a minor improvement and so remove it from the statement of need section as its own paragraph. We leave discussion of this functionality in the paper in other sections.


Comment 2: My other sticking point was "Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines" specifically the point: "enables some new research challenges to be addressed or makes addressing research challenges significantly better (e.g., faster, easier, simpler)." The data table to map being faster/easier/simpler was something I was not convinced about, but with the long back and forth we are at a point where I no longer want to be carrying out extensive testing on behalf of the authors. On the other hand if the key contribution is now this ease for the comparison around the different scenarios, this should be emphasised more.

Response: Yes, one of the key contributions is the comparison around different scenarios and time periods. This is already emphasized in the paper as follows:

- In the title of the paper: "rmap: An R package to plot and compare tabular data on customizable maps across scenarios and time". - Abstract (lines 13 to 14): Additionally rmap automatically detects and produces comparison maps if the data has multiple scenarios or time periods as well as animations for time series data - Statement of Need (lines 54 to 59): Difference maps: Existing packages do not produce difference maps to compare across scenarios or time periods. rmap provides this functionality by automatically recognizing multiple scenarios and time periods to produce difference maps across these dimensions. An important aspect of spatial data is exploring the difference between two scenarios or time periods and rmap makes this a seamless process. - And as a Separate Section called Compare Scenarios (lines 84 -100): With example code and figures.

Spatial comparisons across scenarios and time are very valuable in multiple fields and this is one of the features of rmap that has already been significantly used in the following published papers (and also in several additional papers in-progress and submission):


zarrarkhan commented 2 years ago

@hugoledoux can you please review our response and let us know if anything else is needed? Thanks Best Zarrar

CamilleMorlighem commented 2 years ago

Hi all,

Following comments from @hugoledoux and @maczokni, I have redone old tests of the package I did a while ago and new tests considering the updates of the package. See the following examples:

Load libraries

library(stringr)
library(dplyr)
library(rmap)

Example 1 : Senegalese population per region

# Get population data 
download.file("https://data.humdata.org/dataset/c686be28-1858-44e9-a44b-598ce32ca20b/resource/85ba6660-fad2-4e49-b9b3-9ecb754d5700/download/sen_admpop_adm1_2020.csv", 
              destfile = "sen_admpop_adm1_2020.csv", mode ="wb")
reg_pop = read.csv("sen_admpop_adm1_2020.csv")

# Clean data : missing accent on the "e"
reg_pop = reg_pop %>%
    mutate(ADM1_FR = str_replace_all(ADM1_FR, c("Thies" = "Thiès", "Kedougou" = "Kédougou", "Sedhiou" = "Sédhiou")))

map(reg_pop, 
    subRegion = "ADM1_FR", 
    value = "Total", 
    legendType = "pretty", 
    title = "Population per region in Senegal")

Note that the region name of three regions is misspelt (Thiès, Kédougou, Sédhiou) in the downloaded data with missing acute and grave accents, although the ADM1_FR column should contain the French names of the regions (and hence with accents).

map_param_PRETTY

Example 2 : Covid-19 cases per African country

# Get covid-19 data
download.file("https://data.humdata.org/dataset/9151988e-ae09-416c-9408-6df93bebf3a2/resource/ab898c39-2322-4bf0-a9a7-5897f5605bc1/download/africa_covid19_daily_deaths_national.csv",
              destfile = "africa_cov_cases.csv", mode ="wb")
cases_africa = read.csv("africa_cov_cases.csv", sep = ";")  

# Clean data : aggregate cases, remove empty rows and correct region names
cases_per_country = cases_africa %>% 
    mutate(sum = rowSums(cases_africa[,4:length(cases_africa)], na.rm = T)) %>%
    subset(ISO != "") %>%
    mutate(COUNTRY_NAME = str_replace_all(COUNTRY_NAME, c("Sao Tome and Principe" = "São Tomé and Principe", "Eswatini" = "eSwatini")))

map(cases_per_country, 
    subRegion = "COUNTRY_NAME", 
    value = "sum", 
    title = "Covid-19 cases per African country")

Note here that the names of eSwatini and São Tomé and Principe were misspelt in the downloaded data (respectively Eswatini and Sao Tome and Principe).

map_param_KMEANS

Example 3 : Infectious diseases in California

Plot different years

# Get infectious disease data 
download.file("https://data.chhs.ca.gov/dataset/03e61434-7db8-4a53-a3e2-1d4d36d6848d/resource/75019f89-b349-4d5e-825d-8b5960fc028c/download/odp_idb_2020_ddg_compliant.csv", 
              destfile = "infect_disease_us_county.csv", mode ="wb")

i_disease = read.csv("infect_disease_us_county.csv") ; colnames(i_disease)

# Clean data : keep years 2018-2020 and cases of Malaria, Typhus, Dengue and Lyme diseases
i_4_disease <- i_disease %>% 
    subset(grepl("Malaria|Typhus|Dengue|Lyme", Disease) & grepl("Total", Sex) & grepl("2018|2019|2020", Year)) %>% 
    filter(County != "California")  %>% # csv contains a row with sum of cases for all counties in California 
    mutate(County = paste0(County, "_CA")) # add proper state abbreviation to match built-in map counties 

malaria_map = map(i_4_disease[i_4_disease == "Malaria", ], 
                  subRegion = "County", 
                  value = "Cases", 
                  save = F, 
                  x = "Year", 
                  title = "Malaria cases")

malaria_map$`map_param_KMEANS`

map_param_KMEANS

Plot different diseases

disease_map = map(i_4_disease[i_4_disease$Year == "2020", ], 
                  subRegion = "County", 
                  value = "Cases", 
                  save = F, 
                  class = "Disease", 
                  ncol = 4, 
                  title = "Cases of several infectious diseases")

disease_map$`map_param_KMEANS`

map_param_KMEANS

Compare across years

malaria_map_diff_2018 = map(i_4_disease[i_4_disease == "Malaria", ], 
                            subRegion = "County", 
                            value = "Cases", 
                            save = F, 
                            xRef = "2018", 
                            underLayer = mapUS49County, 
                            title = "Difference with 2018 malaria cases")

malaria_map_diff_2018$map_param_KMEANS_xDiffAbs
malaria_map_diff_2018$map_param_KMEANS_xDiffPrcnt

map_param_KMEANS_xDiffAbs map_param_KMEANS_xDiffPrcnt

For some reason, malaria_map_diff_2018$map_param_KMEANS_xDiffPrcnt does not display values for some counties. I think this might be because extreme values of the legend classes are not comprised in any classes ? (e.g. Counties where exactly 0% change occurs are not represented in any class). I think this issue needs to be resolved. Adding square brackets around the extreme values of the classes could help specify in which class the extreme values are included.

After retesting the package, from my point of view, I think it is the responsibility of the user to make sure its data are complete, up-to-date and correct (e.g. make sure all Region names are properly written) and match the “subregions” of the rmap built-in maps. In any case, if the user had to use another solution than rmap, and join its data for example with the attribute table of an existing shapefile, he/she would still need to clean its data beforehand. Following @maczokni comments, maybe a section of the user guide could highlight the need for cleaning data and matching them with the built-in maps subRegions?

As for the novelty of the package, as raised by @maczokni, I think its true novelty lies indeed in the ability to easily compare across year/scenarios/categories. On another note, compared to existing packages like rnaturalearth and tigris, everything can be done here using only the package rmap, while the two former solutions require combining it with other packages such as tmap. Another advantage of rmap (that is maybe not highlighted enough in the paper) is that it creates ggplot objects which can easily be combined with others to change the map layout, for example, and I think ggplot might be more known among non-GIS users than tmap.

Except for the issue with the legend classes, I think rmap worked as expected in the examples above. Let me know what you think @hugoledoux and @maczokni.

hugoledoux commented 2 years ago

okay, I re-read the (very detailed and good) comments from the reviewers (thanks @CamilleMorlighem and @maczokni) and I think we can go ahead and try to finalise the submission.

I would like the 3 points raised above to be fixed:

  1. add a section about how to clean the data and match them with the maps subregions; this doesn't need to be in the paper but in the docs it would be necessary I believe, both reviewers struggle with this
  2. extreme values not being displayed for some datasets: can you look into that?
  3. in the "Statement of Need", perhaps add what @CamilleMorlighem mentions about ggplot objects?

Once this is done, I am happy to process further with acceptance of the submission

zarrarkhan commented 2 years ago

@hugoledoux @CamilleMorlighem @maczokni Thank you all for the great feedback and comments. Yes, we will address all these points and update the submission. Best Zarrar

zarrarkhan commented 2 years ago

@hugoledoux

Please see each of the points mentioned above addressed in the latest version of the paper.

  1. add a section about how to clean the data and match them with the maps subregions; this doesn't need to be in the paper but in the docs it would be necessary I believe, both reviewers struggle with this

Section on Cleaning Data added to user guide explaining how subRegions will need to be matched to built-in map names along with an example of how rmap automatically identifies the missing regions for you and how you can then go in and find the correct spelling.

  1. extreme values not being displayed for some datasets: can you look into that?

@CamilleMorlighem Thanks for catching this. The issue here actually was that the % difference calculation uses the base year value to calculate the % difference i.e. Percent Diff = (Value in Year_i - Value in Year_Ref)*100/(Value in Year_Ref). In the examples you provided or in general when there was a 0 value in the Ref year the result was coming out to be Infinite ('Inf'). This was then resulting in an NaN value being filtered out and thus those counties being removed. We have fixed this to behave better as follows:

If we redo your example with legendSingleValue = 0 and showNA=0 you get the following. This time rmap gives the warning about the Inf values and we can see "NAs" vs. '0' values in the maps.

library(stringr)
library(dplyr)
library(rmap)

# Get infectious disease data
download.file("https://data.chhs.ca.gov/dataset/03e61434-7db8-4a53-a3e2-1d4d36d6848d/resource/75019f89-b349-4d5e-825d-8b5960fc028c/download/odp_idb_2020_ddg_compliant.csv",
              destfile = "infect_disease_us_county.csv", mode ="wb")

i_disease = read.csv("infect_disease_us_county.csv") ; colnames(i_disease)

# Clean data : keep years 2018-2020 and cases of Malaria, Typhus, Dengue and Lyme diseases
i_4_disease <- i_disease %>%
  subset(grepl("Malaria|Typhus|Dengue|Lyme", Disease) & grepl("Total", Sex) & grepl("2018|2019|2020", Year)) %>%
  filter(County != "California")  %>% # csv contains a row with sum of cases for all counties in California
  mutate(County = paste0(County, "_CA")) # add proper state abbreviation to match built-in map counties

malaria_map_diff_2018 = map(i_4_disease[i_4_disease == "Malaria", ],
                            subRegion = "County",
                            value = "Cases",
                            save = F, show=F,
                            xRef = "2018",
                            underLayer = mapUS49County,
                            labels=T, legendSingleValue = 0)

malaria_map_diff_2018$map_param_KMEANS
malaria_map_diff_2018$map_param_KMEANS_xDiffAbs
malaria_map_diff_2018$map_param_KMEANS_xDiffPrcnt

map_param_KMEANS map_param_KMEANS_xDiffAbs map_param_KMEANS_xDiffPrcnt

  1. in the "Statement of Need", perhaps add what @CamilleMorlighem mentions about ggplot objects?

We mention the point about ggplot objects in the Statement of Needs as follows:

"Post-process customization: Existing packages do not produce output objects that can be saved and then customized. Customization of the maps in existing packages is limited to package specific functionality and arguments. rmap produces ggplot objects in which every element (axis, grids, titles, colors, line widths, facets) can all be customized after the map has been produced. This allows users to capitalize on existing knowledge of the widely used ggplot2 package and its arguments."

hugoledoux commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left: