skoval / deuce

R package for web scraping of tennis data
89 stars 23 forks source link

Encoding: Latin1 vs. UTF-8 #6

Open kvanallen opened 4 years ago

kvanallen commented 4 years ago

Currently, the package is encoding with Latin1. I am not sure why this is, as I have only encountered UTF-8 previously, as it is my default. Is there a specific reason for using Latin1 over UTF-8?

skoval commented 4 years ago

According to the R core team's comments in writing R extensions:

If the DESCRIPTION file is not entirely in ASCII it should contain an ‘Encoding’ field specifying an encoding. This is used as the encoding of the DESCRIPTION file itself and of the R and NAMESPACE files, and as the default encoding of .Rd files. The examples are assumed to be in this encoding when running R CMD check, and it is used for the encoding of the CITATION file. Only encoding names latin1, latin2 and UTF-8 are known to be portable. (Do not specify an encoding unless one is actually needed: doing so makes the package less portable. If a package has a specified encoding, you should run R CMD build etc in a locale using that encoding.)

I think I originally thought that the latin1 setting would set the locale encoding when the package was loaded and avoid some possible mornings with the instances of the use of 'readLines'. But this appears to only have an impact on the handling of package files. It is recommended not to specify this to make the package more portable, which I think could be safely done.

This would leave handling of special characters as something to do within functions (ie wherever readlines or read_html occurs, for example) as needed.