ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

Submission: skynet #214

Closed FilipeamTeixeira closed 5 years ago

FilipeamTeixeira commented 6 years ago

Summary

Type: Package
Package: skynet
Version: 1.0.2
Title: Generates Networks from BTS Data
Authors@R: person("Filipe", "Teixeira", email = "Filipe.MarquesTeixeira@Ugent.be", role = c("aut", "cre"))
Maintainer: Filipe Teixeira <Filipe.MarquesTeixeira@Ugent.be>
URL: https://github.com/FilipeamTeixeira/skynet
Description: A flexible tool that allows generating bespoke
    air transport statistics for urban studies based on publicly available
    data from the Bureau of Transport Statistics (BTS) in the United States
    <https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0>.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
Collate:
    'PowerLaw.R'
    'createNodes.R'
    'data_ODsample.R'
    'data_airports.R'
    'data_carriers.R'
    'data_metro.R'
    'disparityfilter.R'
    'findAirport.R'
    'fromto.R'
    'convertRaw.R'
    'netImport.R'
    'netDir.R'
    'netUnd.R'
    'netInt.R'
    'netPath.R'
    'skynet.R'
    'nodeStats.R'
    'import_t100.R'
    'import_db1b.R'
    'findCarrier.R'
    'plotMap.R'
    'skynet_example.R'
    'bootnet.R'
Depends:
    R (>= 3.1.2)
Imports:
    data.table,
    igraph,
    dplyr,
    ggplot2,
    ggrepel,
    stringr,
    maps
Suggests:
    knitr,
    rmarkdown,
    testthat,
    kableExtra
VignetteBuilder: knitr

Geospatial data, network analysis, because it works with airline networks and it allows its generation and analysis.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

noamross commented 6 years ago

Thank you for your submission, @FilipeamTeixeira! A couple of quick questions before we proceed:

1) You say you are currently updating the package to incorporate other sources. We need a relatively stable version so that reviewers don't have to work with a moving target. Can you let us know the status of this and when you will complete this update?

2) Just checking, do you know if these data are available for download via stable URLs or an API? The submission would be stronger if it enabled download as well as data processing.

3) I see you checked off that the package has a research application but not that you want to submit to JOSS. Was this a mistake, and do you/don't you want to submit to JOSS. If you do, a 200-500 word paper.md is required

Finally, we'll also need you to add code coverage to your CI and include a code coverage badge in your readme, which you can do with usethis::use_coverage().

FilipeamTeixeira commented 6 years ago

Dear @noamross, thank you for your remarks.

  1. Skynet was built to use exclusively data from the Bureau of Transport Statistics, more specifically the DB1B and T100 dataset. Meaning that the final and stable version uses those datasets. However, as some researchers asked to include the possibility of importing data from other sources, I've been building gradually new import functions to allow including other airline data. The update will only be a new import function and it does not affect in any way the functioning of the package as it is. It just processes the csv files from other data sources rather than the DB1B and T100 and processes it into a format skynet can work with.

  2. I've updated the readme file to show the links to those datasets. The vignettes already had them, but hopefully this will be clearer now. Unfortunately the BTS doesn't have an API and web scrapping is rather unreliable in this case. I'm not sure if I understood the data processing remark as skynet, allows importing csv files from the links mentioned in the readme file.

  3. I've submitted it to another journal but it is pending review.

Added coverage badge as well.

Thank you.

noamross commented 6 years ago

Thanks for your responses, @FilipeamTeixeira, that's helpful.

Editor checks:


Editor comments

Below is the output from goodpractice::gp() , which summarizes things that should be fixed before proceeding to review. I've noted that some are optional, but are notes for yourself and reviewers to consider. The main issue is that your code coverage testing coverage is low (<25%). We don't require 100% coverage, but we do require that there is coverage of the major areas, and from the coverage report I see lots of functions and files are not tested. We generally expect at least 70% coverage or so, and then have reviewers comment if they think edge cases are adequately tested.

── GP skynet ───────────────────────────────────────

It is good practice to

✖ write unit tests for all functions, and all package code in general. 22% of code lines are covered by test cases.

R/bootnet.R:21:NA
R/bootnet.R:22:NA
R/bootnet.R:23:NA
R/bootnet.R:24:NA
R/bootnet.R:25:NA
... and 656 more lines

✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting services provide bug trackers for free, https://github.com, https://gitlab.com, etc. ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if you use '<-'.

R/convertRaw.R:34:7
R/convertRaw.R:36:7
R/convertRaw.R:38:7
R/convertRaw.R:40:7
R/createNodes.R:91:5
... and 15 more lines

✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters (optional, to be considered by author and reviewers based on code readability.)

R/bootnet.R:26:1
R/convertRaw.R:47:1
R/convertRaw.R:48:1
R/convertRaw.R:60:1
R/convertRaw.R:61:1
... and 83 more lines

✖ not import packages as a whole, as this can cause name clashes between the imported packages. Instead, import only the specific functions you need. (optional, to be considered by author and reviewers) ✖ fix this R CMD check NOTE: Namespace in Imports field not imported from: ‘maps’ All declared Imports should be used. ✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml These were most likely included in error. See section ‘Package structure’ in the ‘Writing R Extensions’ manual. (You should add an .Rbuildignore file for these, see, https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Building-package-tarballs) ✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE' by default, but are not reserved words and hence can be overwritten by the user. Hence, one should always use 'TRUE' and 'FALSE' for the logicals.

R/PowerLaw.R:NA:NA

──────────────────────────────────────────────


Reviewers: @toph-allen @berdaniera Due date: 2018-05-23

FilipeamTeixeira commented 6 years ago

Thank you @noamross. Your comments have been quite helpful. I've listed all the remarks fixed with - [x] and the ones partially fixed or with comments with - [ ].

noamross commented 6 years ago

Thanks for the (fast!) update. Things look good and I'll proceed to find reviewers.

Also, NBD, but it looks like one of your tests trips your own deprecation warning:

test_import_db1b.R:18: warning: (unknown)
netImport function is deprecated, please use `import_db1b()`, or `import_t100()`.
FilipeamTeixeira commented 6 years ago

Bug fixed.

noamross commented 6 years ago

Now that your package has passed editor check stage, you can add our review status badge status to your README if you would like:

[![](https://badges.ropensci.org/214_status.svg)](https://github.com/ropensci/onboarding/issues/214)
noamross commented 6 years ago

Reviewers assigned!

Reviewers: @toph-allen @berdaniera Due date: 2018-05-23

Aaron and Toph may be interested in the (alpha) pkgreviewer package recently developed by one of our community members to help structure the process: https://github.com/ropenscilabs/pkgreviewr

noamross commented 6 years ago

👋 Hello @toph-allen and @berdaniera, just a friendly reminder that your reviews are due next Wednesday, May 23.

FilipeamTeixeira commented 6 years ago

Hi @noamross. Any further news on this? I'm not sure if the reviewers noticed that they were supposed to review the code until today. Thanks

noamross commented 6 years ago

Hi @FilipeamTeixeira. I'm eager to see the reviews, too! I've been in touch with both reviewers and while they are a bit behind they are on it.

berdaniera commented 6 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 3


Review Comments

Overall, I think this is a valuable package. My impression is that this is important for consistently wrangling and formatting a complex dataset and quickly generating network analyses for that data. Streamlining this preprocessing (skynet's goal) has value for anyone interested in the airline data.

The primary functions are clean and simple. I hope the comments below will be helpful for clarifying some of the documentation. For the functions, can you talk about the choices for using snake case, camel case, and dot naming?

The main recommendations I have are about documentation for getting the data locally first. At the very least, I'd suggest a short numbered guide with instructions for how to download data. When exploring the functions, I was able to get everything to work with the example data. Before I looked at the vignette (which explains which variables I needed) I tried to download data from the weblink but did not get the necessary variables (which import_db1b asks for directly with dplyr::select), which gave me an error.

In a broader scope, I agree with @noamross suggestion to investigate a direct way to download the data without leaving R. The initial section below might be helpful in that. I'm willing to contribute to this portion of the code if we decide that it is in fact possible (will require some more research).

Initial investigation of direct data access

The web form at https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289 creates a POST link that submits a request to their server, with the main part being a sql string, e.g.,: +SELECT+ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID+FROM++T_DB1B_TICKET+WHERE+Quarter+=1+AND+YEAR=1993

That submission creates a temporary file that gets stored here https://transtats.bts.gov/ftproot/TranStatsData/ and the return of the POST contains the url for the file, which is then downloaded with a GET request.

Given that sequence, I think it is POSSIBLE that you could create a function that uses httr to POST the request with a specific set of parameters (pre-specifying the required variables, user defined year and quarter, etc.) to download the data directly.

The only hangup would be if the POST requires cookie credentials from the website. But, I think that could also be obtained in a first step.

Maybe at least think about if this is a desirable function for the package. It would increase the safety of the functions (no errors from users getting the wrong data) and make the process of getting the data that much easier on the user.

Package metadata files

For the readme, I'd suggest adding a short annotated example of downloading data and reading it in, instead of just a link to the files. E.g., do I download all of the data for a given year/quarter? Do I download the default fields from BTS? That step may be familiar to people who use these data but could be a challenge for newcomers.

Online documentation

The online documentation (https://FilipeamTeixeira.github.io/skynet/) is clear and provides a basic overview of the main functions.

The Overview page could give some additional detail to motivate me to use this package. What does the package enable? What are some of the main challenges that arise from this BTS data? E.g., data wrangling and preparation for network analysis?

The example code on the overview page requires me to have an 'OD_2011Q1' object in my environment, which I did not already. I'd recommend adding a line to that example to show the data download and import.

The Get Started page is a great feature of this documentation. On that page, the Import Datasets instructions say to use netImport(), but when I do that I get a warning message:

netImport function is deprecated, please use `import_db1b()`, or `import_t100()`.

This seemed to work:

import_db1b(skynet_example("Coupon_2001Q1.csv"), skynet_example("Ticket_2001Q1.csv"))
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05)

On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE) and got an error about not yet being able to do this operation with the disparity filter. After checking out the help for ?make.netDir I inferred that I could remove disp = TRUE from the code, but it could be explicit in the documentation by showing a code snippet for the carriers line: make.netDir(OD_2001Q1, carrier = TRUE) and adding a note that it doesn't work yet with the disparity filter.

Functionality

The example code

import_db1b(skynet_example("Coupon_2001Q1.csv"), skynet_example("Ticket_2001Q1.csv"))
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05)

# a minimal example code:
make.netDir(OD_2001Q1)

# a fully documented example with the default parameters specified:
make.netDir(OD_2001Q1, 
            disp = FALSE, alpha = 0.003, 
            cap = FALSE, pct = 10, 
            carrier = FALSE, metro = FALSE)

For the import_db1b documentation, I suggest explicitly stating in the description that this is for importing both the Coupon and Ticket data from the same year and quarter.

Group by carrier

On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE) before I looked at the help documentation and got an error about not yet being able to do this operation with the disparity filter.

After checking out the help for ?make.netDir I saw that I could remove disp = TRUE from the code. This could be explicit in the documentation by showing a code snippet with the carrier operation and adding a note that it doesn't work yet with the disparity filter.

Another suggestion is in the main example code to use all of the defaults, either displaying all of the arguments or just using the minimal default example (like I did above).

# does not work:
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05, carrier = TRUE)

# works:
make.netDir(OD_2001Q1, carrier = TRUE)

Filter edge weight

In the help for make.netDir I'd suggest sorting the arguments by their use; disp with alpha, cap with pct.

make.netDir(OD_2001Q1, cap = TRUE)

Create map

The documentation to create maps assumes that I have an object called list, which I think is the saved network object. The first command (from the doc) does not work without creating the object first.

# does not work:
netMap(list$netDir)

# works:
my_network <- make.netDir(OD_2001Q1, cap = TRUE)
netMap(my_network$netDir)

The maps are very nice, good work on that!

Additional functions

my_airport <- findAirport("RDU")

findCarrier("Southwest")

findCarrier("WN")

fitPlaw(my_network$gDir)

# some extra exploration
nodeStats(OD_2001Q1) %>%
  dplyr::filter(airport == "RDU")

fromto_network <- make.netDir(OD_2001Q1)
fromto.stat(fromto_network$gDir, "ATL", orig = "from")

These functions work fine for me. For the power law fit, I wonder if it makes sense to export the Alpha and R^2 results as list objects that can be stored as output instead of just printing.

Code inspection

Inline code documentation is complete and clear, which will make future contributions easy.

Suggest using message or warning instead of print to communicate with the user (e.g., for the return message in fitPlaw or make.path).

noamross commented 6 years ago

Thank you for your review, @berdaniera! A quick note on your comments about importing data. Site-scraping functions like this are useful, but they can be unreliable as data providers may change their back-end processing. I leave it up to you whether to implement this @FilipeamTeixeira, but should you choose to you should make sure that the function documentation and maybe a message indicate that the function is dependent on an undocumented/unsupported API and may be unstable. You should also check that it doesn't violate terms of service and have your HTTP requests set a user-agent like skynet R package, developed by your@email. BTS does not appear to have a robots.txt file which would be the clearest way explicitly allow or disallow this sort of use.

FilipeamTeixeira commented 6 years ago

Thank you for your thorough review @berdaniera. I'll go through the comments as soon as possible, but I just wanted to first reply to the comment about importing data. Initially I had such feature, but they changed something on their website and suddenly it stopped working. There are a couple of packages (R and Python) online which have the same aim, and they all stopped working after a while. A remark I got once from the Bureau of Transport Statistics was that for this data they weren't making any API available, meaning that scraping wasn't allowed as well. I believe that it might be related to the comercial databases providing this data for a hefty fee.

Anyway, I stopped investing time in finding a solution. It might be something to try again if there's an easy way of doing it, but if not I'll have to leave it as it is.

For the rest I'll go through it as soon as possible.

FilipeamTeixeira commented 6 years ago

For the functions, can you talk about the choices for using snake case, camel case, and dot naming?

The rationale behind it was: dot naming for main functions (e.g. make.netDir) with the case at the end to indicate type of network being created, camel case for all secondary functions (e.g. createNodes), snake case for import functions (e.g. import_db1b). If you believe that it doesn't make sense please let me know. When I started writing this package, I never thought it would grow this big, but I wanted to make sure that its functions would end up making some sense.

The Overview page could give some additional detail to motivate me to use this package. What does the package enable? What are some of the main challenges that arise from this BTS data? E.g., data wrangling and preparation for network analysis?

This is a tricky one. I have currently a paper in review, which I would like to wait to be published before adding more information. Then it will be possible to add the full paper, and as much information as possible. I'm presenting this as well in an air transport conference soon, meaning that I can even add more after than.

For the readme, I'd suggest adding a short annotated example of downloading data and reading it in, instead of just a link to the files. E.g., do I download all of the data for a given year/quarter? Do I download the default fields from BTS?

For the import_db1b documentation, I suggest explicitly stating in the description that this is for importing both the Coupon and Ticket data from the same year and quarter.

Done

The example code on the overview page requires me to have an 'OD_2011Q1' object in my environment, which I did not already. I'd recommend adding a line to that example to show the data download and import.

When importing a csv using the import_db1b for example, it creates an object with the name of those files. Meaning that when importing Coupon 2011Q1 and Ticket 2011Q1, it will generate an OD_2011Q1 object.

The Get Started page is a great feature of this documentation. On that page, the Import Datasets instructions say to use netImport(), but when I do that I get a warning message

There was an issue with the documentation which has been fixed meanwhile.

Group by carrier On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE)

I'm not really sure what do you really refer to here. The limitation with the carrier and disparity filter is merely a mathematical challenge. Having multiple carriers, means that the origin-destination matrix will have parallel edges. However, the disparity filter wasn't designed to work with parallel edges. I didn't design the filter as I've mentioned in the documentation.

Filter edge weight In the help for make.netDir I'd suggest sorting the arguments by their use; disp with alpha, cap with pct.

Done

Create map The documentation to create maps assumes that I have an object called list, which I think is the saved network object.

Not sure if I understood your question. As from the documentation, the create map function asks for a data frame. The example with the list, lies on the fact that the make.net* functions always create a list with three objects.

These functions work fine for me. For the power law fit, I wonder if it makes sense to export the Alpha and R^2 results as list objects that can be stored as output instead of just printing.

I've been wanting to improve this function to reflect more what's needed in network analysis. For now if I understood your comment correctly, I've fixed what you've mentioned.

Once again thank you all and if you have any remarks or if I missed something please let me know. Current version is 1.0.4.

toph-allen commented 6 years ago

Geez, sorry I'm so tardy with my review! I'll have it finished by the end Saturday.

I'm really looking forward to taking a look at this package! I have a more-than-passing familiarity with the underlying DB1B and T100 datasets, and am interested to see how they're surfaced here.

noamross commented 6 years ago

My apologies to @toph-allen for mis-tagging him and to @toph for bothering him unnecessarily!

toph-allen commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing:

2.5-3


Review Comments

Overview

(I wrote this overview after I wrote most of the comments that follow)

The skynet package performs useful functions on some flight datasets which are tricky to work with by themselves. The DB1B and T100 datasets are complex, and having a package which helps wrangle them into a more usable format is a definite value.

I think there could be improvements made to the package's documentation, and to its initial setup process.

Installation

Installed with no problem.

Importing the Files — Documentation & Code

DB1B and T100 are really confusing datasets, especially when you're first dealing with them.

The README.md file on the skynet repo does a good job of orienting the user to set up the package and referring them to the place they can download the files.

However, I ran into some holdups getting the right files downloaded for import. The README.md recommends downloaded files with a specific set of variables included, because the full datasets are very large.

The table on line 58 in README.md (line 55 of README.Rmd) lists the suggested variables to include. However, these variable names are not a direct match for the "Field Name" or the "Description" fields of the DB1B download form.

For example:

For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I'm unclear if the recommended names for the T100 dataset are a requirement for the package to function correctly or just a recommendation for clarity.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

The import_db1b() function (and its T100 equivalent) should include a more explicit description of the object that they generate.

Comments on function style

I would recommend making some changes to the behavior of the function import_db1b(). If other functions in the package behave similarly, I would recommend changing them too.

First, I'd recommendation is to change the names of the function's arguments and the way they're handled.

The function requires two arguments, x and y, each a CSV file. The function searches for "ticket" and "coupon" in the file names to determine which is which. Using the file names as implicit indicators of their contents, and having two named arguments which are actually interchangable, goes against common coding style and might be confusing to users.

In my opinion, it would be clearer to users and more robust to different file names if the function required two arguments named coupon and ticket, each corresponding to the expected file — especially since both are required.

Second, I'd recommend changing the way the function returns its data. The function uses the filename of the input coupon dataset to determine the name of an object, and assigns that object. This object is assigned in an environment which is assigned in import_db1b.R, presumably when the package is loaded.

The expected behavior would be for the function to return the created object, so that it can be assigned like:

db1b <- import_db1b("data-raw/71336240_t_DB1B_COUPON.csv", "data-raw/71336240_t_DB1B_TICKET.csv")

This gives the user more control over the name and location of the object that's created by the import process.

The import_t100() function follows this second pattern, and I'd also recommend changing it there.

Network Generation Functions

I'm less familiar with iGraph, so my comments here are a little less detailed.

Using the functions included, I was able to replicate the vignette's network map from the DB1B datasets successfully.

All the functions which begin make. process the imported data to generate various different objects. Some of these are clear from the function documentation, but some aren't, and they could generally use a little more detailed description. In general, I would recommend including the following information:

Miscellaneous

FilipeamTeixeira commented 6 years ago

@toph-allen Thank you for your comments and thorough review.

In particular, I echo the previous reviewer's comments about clarity getting the user set up with the correct data files downloaded. Ideally, as the other reviewers have noted, the package would include a way to download the data directly in R. Failing that, clearer documentation would be helpful.

As I've mentioned before the BTS changes their website structure often which I assume that it is related to the fact that they don't allow scraping. It might be counterproductive to create a function that will be unusable after a couple of weeks.

I found that there were some variables named expected by import_db1b() which differed from the BTS names for downloaded files, and some other non-standard ways that the function operates, which could combine to cause confusion and frustration for a user who's unfamiliar with the package or the data. In particular, I think that the import function's use of argument names, and its assigning rather than returning the final object, should be changed.

Thank you for your feedback on this. However, the variables to be used are well described in the ReadMe file and help section for each function. There's as well an argument zip for when the users don't want to manually select their variables.

I echo the previous reviewer's comments about some of the naming of objects in example code being somewhat confusing (e.g. an object called list in the netMap() function documentation). I think that the documentation would be improved by some additional information, as described above, below, and echoing the comments from other reviewers. Installation

I'm not sure about this comment and to which naming you are referring to. Some of the issues previously mentioned by the other reviewer have been fixed.

However, I ran into some holdups getting the right files downloaded for import. The README.md recommends downloaded files with a specific set of variables included, because the full datasets are very large.

The table on line 58 in README.md (line 55 of README.Rmd) lists the suggested variables to include. However, these variable names are not a direct match for the "Field Name" or the "Description" fields of the DB1B download form.

For example:

"Destination" is one of the recommended fields, and this corresponds to the variable whose "Field Name" is "Dest" and "Description" is "Destination Airport Code". The DB1BTicket variable "FarePerMile" is referred to as "Yield" in the README.md file and the error messages displayed, and I was only able to figure out which variable needed to be checked by cross-referencing with the package source code. For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I don't think that I fully understood these remarks. Both the DB1B and the T100 dataset use non-standard naming. This means that if you select the variables yourself, they will have one name, but if you download the full file (zipped) they will be named differently. The reason why I chose a different variable name was to homogenise this and other issues. As I am working on a way to import other datasets (e.g. Sabre, OAG), it only makes sense that the variables have a standard name closer to all the others. Also long named variables might generate confusion and they are not really clear for those reading with statistics.

I'm unclear if the recommended names for the T100 dataset are a requirement for the package to function correctly or just a recommendation for clarity.

File size is not an issue as they are quite small, so you can always download all the variables.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

Again this is described in the ReadMe File. There are differences between the zipped file and the pre-selected variables file. The variables are named differently, and this might be the reason of your confusion. When importing the zipped file which uses the zip = TRUE argument, the variable is name Distance and BulkFare, however, when selecting the variables yourself, they are called DISTANCE_FULL and BULK_FARE respectively. So there's no error there. I lower cased everything as well to be easier to import to SQL databases.

The import_db1b() function (and its T100 equivalent) should include a more explicit description of the object that they generate.

Thank you for your comment, but this is explained in the Vignettes.

The function requires two arguments, x and y, each a CSV file. The function searches for "ticket" and "coupon" in the file names to determine which is which. Using the file names as implicit indicators of their contents, and having two named arguments which are actually interchangable, goes against common coding style and might be confusing to users.

In my opinion, it would be clearer to users and more robust to different file names if the function required two arguments named coupon and ticket, each corresponding to the expected file — especially since both are required.

Changed.

Second, I'd recommend changing the way the function returns its data. The function uses the filename of the input coupon dataset to determine the name of an object, and assigns that object. This object is assigned in an environment which is assigned in import_db1b.R, presumably when the package is loaded.

When working with temporal data it is easier this way. However, the name now is created not with basis on the file but on if it is DB1B, which will start then with OD, or T100, which will start with T100. The rest will be calculated with basis on the year and quarter variables. The reason why I forced the name this way was that an uniform naming seemed to be helpful according to the researchers working with this data, and who gave me feedback on this.

Details on the type of object returned by the function. For example, make.netDir() returns a list containing a data frame, an iGraph object, and summary statistics. However, this information isn't detailed in the function's documentation, only in the package's vignette. If the fields in the returned objects aren't immediately obvious from their names, I would include a list of the variable types and how they are computed. For iGraph objects, which may be less widely known than objects like data frames, you might want to include a link to the iGraph package documentation online. This is less crucial, though — users can look this up themselves if need be — unless you link to a specific, relelvant part of the iGraph documentation.

Thank you for your comments. However I'm not sure if I understood them correctly. I've added some extra information on the named objects generated, but I'm not 100% sure if based on your comments if it will be enough.

I'd include a link in the documentation for import_db1b() and import_db1b() either to the pages where the user can download the files, or to the documentation which includes more detailed instructions on downloading the files with the requisite variables.

It's in the ReadMe file.

The README.md recommends naming the T100 CSV files to T100 year mkt and T100 year seg, but the function documentation recommends naming them T100 2011Q1mkt.csv and T100 2011Q1seg.csv. Ideally, the filenames shouldn't matter, and the files should explicitly be passed as one type or another. The README.md should include community guidelines.

Fixed.

The usage of the skynet_example() function is not clear from its documentation. I don't know if this function is entirely necessary — example data could be included as .RData files in the package's data directory.

This was required and requested by CRAN. As for the package I do understand your remarks, but I do want this package to follow CRAN's reviewers comments as well, so for now it has to be there.

The netMap() function's help page is titled PlotMap — perhaps leftover from a previous name of that function, judging from the source file's name.

Fixed.

The map function uses curves, but these aren't great circle arcs — perhaps this caused unwildy plots for some map projections, though, so this isn't as big a deal. The plots produced are clear.

Thank you for your comments here. I actually didn't want them to be great circles arcs. The reason why they are curved is to be easier to read the map. This was feedback from researchers using the package for their research.

My initial thought was that the functions might be clearer and more consistent if they all used snake case for names — I wasn't sure if there was a specific rationale for using that versus dots versus camel case. I see that the other author has raised this, and while I think that consistent snake case would be more in line with what I'd have expected, it's not something which I'd quibble with too much.

Thank you for your comments here. I've explained the rational behind the naming above in my comments to the other reviewer.

Thank you @toph-allen, @berdaniera and @noamross for all the comments and for all the help making this package better. Meanwhile I've pushed all the changes to my GitHub, and skynet is currently on version 1.1.0.

noamross commented 6 years ago

Thank you @toph-allen for your review and @FilipeamTeixeira for your response. Let me comment on a few things here now that both reviews are in:

@toph-allen and @berdaniera, please let us know how well @FilipeamTeixeira's changes address your comments.

FilipeamTeixeira commented 6 years ago

@noamross Just a quick reply to your comments.

1- Indeed that would require a major change which I might include on a next major release. 2- I've been thinking of "fixing" this by working with classes. That would eliminate the need of specific naming for several objects. However, this would only come on a major release as well. 3- Fixed. I forgot that function indeed. 4- I'm not sure to what this refers to.

Thank you once again for your constructive and detailed comments.

noamross commented 6 years ago

Hello, @FilipeamTeixeira, regarding (4), a specific example I'm referring to is @toph-allen's suggestion

I'd include a link in the documentation for import_db1b() and import_db1b() either to the pages where the user can download the files, or to the documentation which includes more detailed instructions on downloading the files with the requisite variables.

Yes, this information is in the README, but the concept of "multiple points of entry" is that the user may be looking for information in the function file and should find via repeating information or linking to another source.

Regarding 1,2,3, and other issues brought up by reviewers: Having a consistent and fully documented API where functions behave as commonly expected is pretty core to our standards, so these major-version changes will be needed before we're able to accept. We at times accept packages that only address some review iterms via a "roadmap" for a future version (see the skimr review for an example), but this tends to be limited to cases of extensions or at times re-factoring of internals for the purpose of enabling extensions.

Specifically with (2) - classes seem like a likely solution and S3 classes are relatively simple to implement, though I'm still not sure what the problem is that assigning to the environment solves, based on your feedback from users. Perhaps if you describe the use-case in a bit more detail we could suggest a solution.

FilipeamTeixeira commented 6 years ago

Dear @noamross. Thank you for your comments and for clarifying my questions. For (4) I've fixed this issue. I wasn't fully understanding what was exactly the remark but I've managed to understand what was the goal now. Regarding (1), I'm afraid that it won't happen anytime soon. The main reason here is that this package's paper has been submitted and to change every function in the paper now would be sloppy. There is still some consistency in the functions and what they are meant to, and I do intend to change it in a nearby future, probably still this year when I add machine learning support. I hope you all understand this, and that I don't really have an option here.

About (2), I'm also not found of assigning objects like that but the issue is simple. Most users will use this package for longitudinal analysis. Meaning that they won't work merely with one year but with multiple. Some users mentioned that it would be easier to name the imported objects automatically, so they could incorporate them in an easy way into loops. If the object syntax is always the same, except for the suffix which shows the year and quarter, it is easy then to run any sort of analysis for lets say 10 years. From my understanding classes wouldn't solve this, as you would still need to name each object. So for example if I wanted to run an average degree centrality function on all of my networks and get confidence intervals, it is easy to do it if the only thing changing is the suffix of my target object(s). I am aware that you can do this from the import, but this standardises it and makes it more consistent. However, if you have any other suggestion I'm willing to look into it as I don't doubt that you all have more experience than I do.

(3) has been fixed, so there's nothing to point here.

Once again thank you for your remarks and for your help with this package.

noamross commented 6 years ago

@FilipeamTeixeira Regarding the API and review timing issue, one option would be to put this issue on hold until your other review is done. We generally don't recommend putting a package in review in two places at once, for just this reason. There are times where, because a package is in widespread use or tightly linked to other tools, it makes sense to avoid changing the API for compatibility reasons. However, I don't think that is the case here - we would rather help get the package to a point where it is more usable for more people.

Re: (2) Making function input or output depend on file or object naming is pretty fragile and incompatible with common functional programming strategies. A much more common and robust way to doing this type of workflow is to create a list of objects and then iterate on that list using a function such as lapply() or purrr::map(). This approach that is broken by assigning the object to the environment. If the date of the object is critical, a better way for it to be retained is as an attribute or as part of the list that makes up the imported object (along with the igraph and data frame). Then if you are iterating over several objects, it is easy to extract the date from each one. S3 classes help this along with, for instance, print methods that highlight important parts of the object such as the date, or extractor functions like date(object). If your audience isn't familiar with lists or iterating functions like lapply() you can demonstrate it in the vignette, and another option is to create a helper function that, say, imports a folder full of data and creates a list of objects.

@toph-allen and @berdaniera, it would be helpful if you could let us know well @FilipeamTeixeira's changes address your comments at this stage. Even if we pause the review it would be good to record those at this point so we have a record of the current state.

FilipeamTeixeira commented 6 years ago

@noamross I do understand and I also don't want to start a discussion here about this as it would be counterproductive. However, I have the feeling that I'm probably not properly conveying my message and the users' needs.

The need of having correctly named objects is to avoid having to create lists. surely that with lapply() or with purrr::map() it is possible to run a function based on the S3 classes, but I'm still not really sure how it would optimise this. Running a function would have to then scan through the list, check each S3 class from each object, return it, and apply a function on it if it matches the requirements. Now it only has to get the object name and run the function. I'm all for optimising my code, but I would really like to understand how something will do it, as we shouldn't do something just because it has been standard. What might work for some, might not work for others and there are plenty of examples like that in R.

As for naming the functions I do understand as well. However, this decision was made after talking to people working with the package. I would like to reiterate and stress that this package is rather specific when it gets to its target. Meaning that we could not compare it with others. Also dplyr uses both snake case and dot naming, so this example exists in numerous packages.

If indeed this will be paused, it is a pity, but I regret to say that it will stay this way for a while as the users are my priority. If there are any remarks on the function naming and any arguments that would explain a standardisation of those functions, I'm happy to reconsider the changes. Another alternative is to create aliases for the functions as it is possible to find in numerous packages running the risk of increasing complexity.

In any case I'm truly thankful for all the help you guys gave so far and for all the time invested in this. So even if this goes nowhere it was already rather productive for me.

toph-allen commented 6 years ago

@FilipeamTeixeira Thanks for your responses to my review. I'll attempt to respond concisely to them, and to clarify and contextualize some of my original comments.

For context, I have some familiarity with the DB1B and T100 datasets — I'm part of a team at work which is creating models using these datasets. This is our first time working with these datasets, and we wound up creating some internal scripts to do some of what skynet accomplishes. I can definitely see a role for a package like skynet in helping users work more quickly and efficiently with these datasets.

I see that you made some updates to the code in response to some of the discussion here. Thanks for being responsive to our feedback. I think that some of the changes, especially with regard to things like argument handling, will be beneficial to new users. @noamross's comments on the consistency of arguments across packages are good explanations as to why.

"Destination" is one of the recommended fields, and this corresponds to the variable whose "Field Name" is "Dest" and "Description" is "Destination Airport Code". The DB1BTicket variable "FarePerMile" is referred to as "Yield" in the README.md file and the error messages displayed, and I was only able to figure out which variable needed to be checked by cross-referencing with the package source code. For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I don't think that I fully understood these remarks. Both the DB1B and the T100 dataset use non-standard naming. This means that if you select the variables yourself, they will have one name, but if you download the full file (zipped) they will be named differently. The reason why I chose a different variable name was to homogenise this and other issues. As I am working on a way to import other datasets (e.g. Sabre, OAG), it only makes sense that the variables have a standard name closer to all the others. Also long named variables might generate confusion and they are not really clear for those reading with statistics.

I see what you're saying here — since the datasets all use different names for certain values, you're using a generic name. In that case, it might be helpful to give some examples of what the variables might be named — but this wasn't a high priority comment, in my mind.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

Again this is described in the ReadMe File. There are differences between the zipped file and the pre-selected variables file. The variables are named differently, and this might be the reason of your confusion. When importing the zipped file which uses the zip = TRUE argument, the variable is name Distance and BulkFare, however, when selecting the variables yourself, they are called DISTANCE_FULL and BULK_FARE respectively. So there's no error there. I lower cased everything as well to be easier to import to SQL databases.

When this happened, I was actually using a normal the custom-selected variable file. The following code outputs the first line of the file; you can see that the variables are named differently from the function. Do you think that this is a recent change to the BTS website, like you mentioned?

$ head -n 1 71336240_T_DB1B_TICKET.csv
"ITIN_ID","COUPONS","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","ROUNDTRIP","ITIN_YIELD","PASSENGERS","ITIN_FARE","BULKFARE","DISTANCE_FULL","DISTANCE_GROUP","ITIN_GEO_TYPE",
$ head -n 1 71336240_T_DB1B_COUPON.csv
"ITIN_ID","MKT_ID","SEQ_NUM","COUPONS","YEAR","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","QUARTER","ORIGIN","DEST_AIRPORT_ID","DEST_AIRPORT_SEQ_ID","DEST_CITY_MARKET_ID","DEST","TRIP_BREAK","COUPON_TYPE","OPERATING_CARRIER","DISTANCE","GATEWAY",

@noamross's comments about making documentation more available and comprehensive, and about returning an object from the import functions for assignment, convey my thoughts on those subjects very well. I had no idea that the temporal nature of the dataset was the reason for the assignment and file naming. (Our team was actually doing the same thing with these datasets — using file names to represent different quarters, and pulling that info from the file name, and I had us instead use the year and quarter columns from the datasets to pull that info, for the principles that have been outlined elsewhere in the thread.)

I'll keep an eye on this and am happy to help out more as needed.

noamross commented 6 years ago

Many of our packages have a specific audience, but as I hope the case of @toph-allen above shows, there are many potential users out there who may use your package for reasons you have not anticipated. Our purpose in standardizing is to make sure this larger audience is served and that your package can be useful to a broader audience, and that your work gets exposure to them. We do not want standards for standards' sake. Of course every package has idiosyncrasies driven by speciality uses. Yet we want to ensure the exceptions help a user understand the package and keep it compatible with other workflows they may have. (In dplyr, for instance, dot-naming specifically identifies S3 conversion methods such as as.xxx which is a very common pattern across packages and base R)

Aliasing names and deprecating functions over time is strategy we recommend! My co-editor Scott has written a guide to this: https://ropensci.org/technotes/2017/01/05/package-evolution/

FilipeamTeixeira commented 6 years ago

@noamross Thank you for your comments. I'll try to add some aliases as soon as possible. As for @toph-allen 's comments, I believe that there might be some confusion here. He actually supported assigning names to the objects for the sake of temporal analysis.

FilipeamTeixeira commented 6 years ago

@noamross function naming was corrected according to your specifications. Except for one minor internal function which I will patch for the next version. I kept all the older functions but with a message stating that they are deprecated. As for the object created by assignment I hope that @toph-allen's comment made it clearer on the reason behind such approach.

I believe that all the requirements have been fulfilled unless there's something else I might be missing.

Once again thank you.

toph-allen commented 6 years ago

Just to eliminate any confusion: In my review, I wasn’t requesting an argument to let the users name the objects which are then assigned in-function to the global environment. Rather, I think it would be better to follow convention and return the object at the end of the function with return(object), so that the user can assign it when they call the function, as in processed_data <- import_db1b(x = “foo”, y = “bar”).

FilipeamTeixeira commented 6 years ago

@toph-allen So what was your point with the temporal data and the naming? Because there's no other way of running temporal data (in this case), except for asking the users to name every single object. Which I find counterproductive. As we can't "scan" the entire environment for classes or even arguments I'm not sure what's the alternative.

I had no idea that the temporal nature of the dataset was the reason for the assignment and file naming. (Our team was actually doing the same thing with these datasets — using file names to represent different quarters, and pulling that info from the file name, and I had us instead use the year and quarter columns from the datasets to pull that info, for the principles that have been outlined elsewhere in the thread.)

What made you change between methods? I can give the user the option through an argument. That's the only option I see.

FilipeamTeixeira commented 6 years ago

@noamross @berdaniera @toph-allen To reach a middle term, I've added the option to not have the import functions automatically assigning objects. This way the user can choose. I assume that this satisfies then both parties.

R CMD check results 0 errors | 0 warnings | 0 notes

toph-allen commented 6 years ago

@FilipeamTeixeira Taking a look at the code, I see that a user can now provide the argument auto = FALSE to automatically return the created object. That assignment behavior is the one I'd expect, coming to the package with no prior knowledge, and it's good that it's available. 👍

To respond to your point about temporal data:

If I'm understanding it right, the motivation for having the assignment happen automatically in the function is that it makes it easier for your users to then loop over a vector of filenames for different time periods and automatically create the objects (for example, CSVs for 2017 Q1–Q4)?

If I were dealing with multiple files, and the function returned an object for each file, I might use one of the map() family of functions, from the purrr package, to either create a list of objects, or map assignment myself.

Or, I'd create a single data frame for all time periods, and use group_by() %>% summarize() to create summaries for different time periods.

And having the option to return the object enables both of these workflows.

noamross commented 6 years ago

Thanks, all. @berdaniera and @toph-allen, could you at this point look over the full changes and let us know if they cover all of your original concerns?

FilipeamTeixeira commented 6 years ago

@noamross Sorry to bother but any news on this? Thank you.

berdaniera commented 6 years ago

@FilipeamTeixeira @noamross I still need to look over all of the changes. I've set aside time this afternoon for it. Look for a reply from me by the end of the day.

berdaniera commented 6 years ago

Thank you for replying to my review comments @FilipeamTeixeira. I am happy with the changes that you have made. My primary concerns in the original review were around the function naming, basic examples, and documentation for new users. I understand the desire to wait on further overview documentation until some text is published. The changes in the functions address my original concerns and the expansion of the online documentation will be helpful for new users to get data. The update on importing data looks good to me too.

Overall, the author has responded to my review and made changes to my satisfaction. I recommend approving this package.

toph-allen commented 6 years ago

I concur. I think that the changes @FilipeamTeixeira has made have address my concerns and improve the package overall, making it easier to begin using. Exposing the package to more people and workflows will help it to continue moving in a direction where it can be broadly useful.

noamross commented 6 years ago

Approved! Thanks @FilipeamTeixeira for submitting and @toph-allen and @berdaniera for your reviews!

To-dos:

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

FilipeamTeixeira commented 6 years ago

@noamross Thank you for your help and for accepting the package.

I just have a practical and what might be a dumb question, but is it possible to keep the package both on my repo and on the ropensci repo? Even if mirrored or so? Just asking because some people are used to check my GitHub for the package and I'm not sure how to re-point them to the new link.

Thank you

noamross commented 6 years ago

@FilipeamTeixeira GitHub automatically sets up redirects when packages are transferred, so users visiting the original link to your repo will find it.

Note you can also still pin the repository to your GitHub profile.

FilipeamTeixeira commented 6 years ago

@noamross thank you a lot again for all the help. About the blog post you've mentioned I would love to contribute. So just let me know how can I do it.

Once again thank you.

stefaniebutland commented 6 years ago

About the blog post you've mentioned I would love to contribute.

@FilipeamTeixeira Thank you for your patience - was up to me to get back to you about your interest in contributing a blog post but I'm about 19 days late!

This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here for more short form posts: https://ropensci.org/technotes/.

Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. Right now I have openings in the blog calendar for late August and September. When do you think you'd be ready to submit a draft?

Happy to answer any questions.