ropensci / parzer

Parse geographic coordinates
https://docs.ropensci.org/parzer
Other
63 stars 6 forks source link

unicode #10

Open sckott opened 5 years ago

sckott commented 5 years ago

noam suggest maybe link to stringi c++ lib https://github.com/gagolews/ExampleRcppStringi

sckott commented 5 years ago

ben raymond script on dealing with unicode issues in coords https://gist.github.com/raymondben/41428babea1e6f1ae0fb1db041f37dfd

jeroen commented 4 years ago

Can you describe the problem a bit?

sckott commented 4 years ago

notes from twitter conversation:

possible things used for degree signs

° U+00B0 DEGREE SIGN ⸰ U+2E30 RING POINT ◦ U+25E6 WHITE BULLET ∘ U+2218 RING OPERATOR ˚ U+02DA RING ABOVE ⁰ U+2070 SUPERSCRIPT ZERO ᵒ U+1D52 MODIFIER LETTER SMALL O º U+00BA MASCULINE ORDINAL INDICATOR

Also, there are a lot of things that might get used for minutes. Like `΄'ˈˊᑊˋꞌᛌ𖽒𖽑‘’י՚‛՝`'′׳´ʹ˴ߴ‵ߵʻʼ᾽ʽ῾ʾ᾿ (found via https://unicode.org/cldr/utility/confusables.jsp?a=%CA%B9&r=None)

Most correct would probably be: ′ 2032 PRIME ‵ 2035 REVERSED PRIME

sckott commented 4 years ago

@jeroen thx for dropping in. the highest level issue is I just don't know where to start with unicode on the C++ side. if the issue was dealing with it on the R side, I'd have a better sense of what to do, but I want all the work to be done on the C++ side ideally.

an example, someone shared this latitude string on twitter that didn't get parsed by this package at the time they shared it.

43º21' N

Whats supposed to be a degree sign (°) is instead the "masculine ordinal indicator" (º)

As far as I can tell, there's no way on the C++ side to parse non ascii characters unless you use some kind of unicode library?

right now I use a hack on the R side https://github.com/ropensci/parzer/blob/master/R/zzz.R#L20 to parse known characters, but the list of potential characters we'd have to deal with i'm sure is quite long as shown in my above comment

noam suggested a while back potentially potentially linking to the stringi C API, but im not sure if that's a good approach and if there's anything better