ropensci / parzer

Parse geographic coordinates
https://docs.ropensci.org/parzer
Other
63 stars 6 forks source link

updated Scrub for efficiency #30

Closed AlbanSagouis closed 3 years ago

AlbanSagouis commented 3 years ago

Description

This new version of scrub(), written in base R only is much faster than the previous one. Instead of looking for a long list of special characters, the gsub looks for anything but numbers, letters, '.', ',', ' ', '-'.

I did not profile the parsing of a long string and I don't know if scrub() plays a big part in the time parsing takes but it is still an improvement.

Example

> microbenchmark::microbenchmark(scrub(rep(c(test_lats, test_lons),10)), scrub2(rep(c(test_lats, test_lons),10)), times = 1000)
Unit: microseconds
                                     expr    min      lq      mean  median      uq    max neval
  scrub(rep(c(test_lats, test_lons), 10)) 3144.3 3197.60 3399.5188 3320.95 3488.30 4466.2  1000
 scrub2(rep(c(test_lats, test_lons), 10))  397.0  413.55  450.9085  423.65  459.45  922.1  1000

I updated the tests on scrub which might not be the best practice but it makes sense: In expect_equal(scrub("``º′″"), "'''''") we can expect 5 characters not for and the parsing functions handles it well.

sckott commented 3 years ago

thanks, having a look