ropensci / parzer

Parse geographic coordinates
https://docs.ropensci.org/parzer
Other
63 stars 6 forks source link

parzer methods for numeric input #34

Closed robitalec closed 3 years ago

robitalec commented 3 years ago

Thanks for the neat package.

I was exploring this as an option for aggregating different sources of data with lat/long columns to generate consistent coordinates columns.

Most of my lat long inputs, however, are numeric - and I'm mostly interested in the checks that parzer applies to the input. Unfortunately, it seems that parzers methods are quite slow compared to what we might expect from other approaches. Simplified example here showing just a check for lat long within expected ranges:

library(parzer)
library(data.table)

N <- 1e4

DT <- data.table(
    X = runif(N, -120, -100),
    Y = runif(N, 50, 62)
)

system.time({
    DT[, parsex := parse_lon(X)]
    DT[, parsey := parse_lat(Y)]
})
#>    user  system elapsed 
#>   4.929   0.019   4.968

system.time({
    DT[!between(X, -180, 360), parsex := NaN]
    DT[!is.nan(parsex), parsex := X]
    DT[!between(Y, -90, 90), parsey := NaN]
    DT[!is.nan(parsey), parsey := Y]
})
#>    user  system elapsed 
#>   0.004   0.000   0.003

Created on 2021-06-24 by the reprex package (v2.0.0)

After I saw that most of the methods in the docs show various character inputs, I realized this might be by design. So I'm wondering:

  1. If parzer is designed for mostly parsing messy character input and not for numeric input - can I submit a PR highlighting this in the README or docs? It could be useful in this case to also include a list of checks that parzer applies so that folks could learn from this what best to look out for in numeric data (eg. valid ranges of lat, long, inverted lat lon, etc).

  2. If parzer is also well suited to numeric, maybe there are simpler methods we can pass to for processing numerics that avoid the less efficient character method? This feels like probably more work than it's worth - most users understand how to check and manipulate numeric columns to check values are within some range etc.

Thanks!

AlbanSagouis commented 3 years ago

Thanks for the comment! Just the beginning of an answer: yes parzer was developed for coordinates stored in character strings.

This page comparing packages dealing with coordinates might be of help for you or to document better what makes parzer unique (https://ropensci.github.io/CoordinateCleaner/articles/Comparison_other_software.html).

robitalec commented 3 years ago

Great resource, thank you for sharing it. Ya, the more I wrote out this issue and thought about parzer, I realized it's real strength is in parsing messy characters. I can manage the relatively short list of simple checks for numeric coordinates with a couple data.table calls. From the linked table, I'm mostly focused on the first few: missing coordinates, duplicated, 0/0, identical lon/lat, within a study region and possibly outside of the CRS.

Thanks!

sckott commented 3 years ago

Thanks for the issue @robitalec and thanks @AlbanSagouis for answering.

Echoing @AlbanSagouis point that it's designed for character input.

I agree on updating README or a vignette highlighting that the pkg is designed for character data, and give a list/explanation of the various checks that are done. If you are willing to do a PR or get one started at least, that'd be great

robitalec commented 3 years ago

Thanks @sckott and @AlbanSagouis.

I've opened a PR now. At the moment, it doesn't include a list of checks that are done. Relevant to numeric input, I found checks for -90, 90 and -180, 360. But I'm not sure it's relevant to explicitly list them. Let me know what you think.