ramnathv / slidify

Generate reproducible html5 slides from R markdown
http://www.slidify.org
844 stars 339 forks source link

Having problem with Chinese Characters in Windows environment #329

Open hetong007 opened 10 years ago

hetong007 commented 10 years ago

Chinese characters are encoded as UTF8 in Linux/OS x, but they are encoded as GBK in Windows. Slidify is having problem with understanding UTF8 and GBK now.

One can clone my repo Douban_Folksonomy to reproduce the following result. A properly generated html version(under Ubuntu 12.04) is available here. I am using Windows XP, but the same problem could be found on Windows 7 as well.

Here are the first few lines in my 'index.Rmd' file:

---
title       : 豆瓣网标签的整理和分析
subtitle    : 
author      : 何通
job         : 豆瓣算法组实习生
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      # 
widgets     : [bootstrap]            # {mathjax, quiz, bootstrap}
mode        : selfcontained # {standalone, draft}
--- #ending

## 什么是标签?

---

When using Windows, if my 'index.Rmd' file is encoded as UTF8, then function slidify will throw out an Error , with unrecognized Chinese characters.

 > slidify('index.Rmd')

processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code

output file: index.md

Error in substring(string, start, end) : 
  invalid multibyte string at '<90>
<73>ubtitle    : 
author      : 浣曢€<9a>
job         : 璞嗙摚绠楁硶缁勫疄涔犵敓
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      # 
widgets     : [bootstrap]            # {mathjax, quiz, bootstrap}
mode        : selfcontained # {standalone, draft}
--- #ending

## 浠€涔堟槸鏍囩锛<9f>

---

>- 璞嗙摚鐢靛奖涓殑鏍囩
  - ![](pics/what_is_folksonomy2.png)
>- 璞嗙摚闊充箰涓殑鏍囩
  - ![](pics/what_is_folksonomy3.png)
>- 璞嗙摚闃呰涓殑鏍囩
  - ![](pics/what_is_folksonomy4.png)

---
## 浠€涔堟槸鏍囩

>- 鐢ㄦ埛涓诲姩鐢熸垚
>- 瀵规枃瀛楀唴瀹逛笉鍔犻檺鍒
>- 鏄鐗╁搧鏈夌泭鐨勮ˉ鍏呰鏄庝俊鎭
>- 鑻辨枃閲岀О杩欐牱鐨勪笢瑗垮彨鍋<9a>**folksonomy**(folk+taxonomy)锛屽苟涓嶆槸*tag*

---

## 鏍囩鏃犲涓嶅湪

闄や簡璞嗙摚锛屽叾瀹炶繕鏈夊緢澶氬湴鏂瑰嚭鐜颁簡鏍囩锛<9a>

>- 鏂版氮寰崥涓殑鏍囩
  - ![](pics/

Obviously showing different characters and of course nobody could understand the latter one.

If I turn to GBK for Chinese characters, function slidify will work:

> slidify('index.Rmd')
processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code
output file: index.md
[1] "index.html"

But the html contains unrecongnized characters:

Inproper HTML Comparing to the proper version: proper HTML

ramnathv commented 10 years ago

Can you print out your sessionInfo() so that I can see what versions of packages you are using?

hetong007 commented 10 years ago

Here comes:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Chinese_People's Republic of China.936 
[2] LC_CTYPE=Chinese_People's Republic of China.936   
[3] LC_MONETARY=Chinese_People's Republic of China.936
[4] LC_NUMERIC=C                                      
[5] LC_TIME=Chinese_People's Republic of China.936    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
[1] tools_3.0.2

And the result was generated with slidify 0.3.3

ramnathv commented 10 years ago

It seems to work with the latest version of slidify. I checked online using the slidify playground at http://slidify.github.io/playground. Make sure to remove the line with mode before you paste it to the playground.

You can install the latest version of slidify and slidifyLibraries by running

devtools::install_github(c('slidify', 'slidifyLibraries'), 'ramnathv')

Before you slidify your deck, make sure to delete the libraries folder in your slide deck directory.

hetong007 commented 10 years ago

I met the same problem after installing the latet version according to your code.

Since Linux/OS x could handle Chinese fluently, I guess the success of slidify playground is not surprising.

But is slidify playground running under Windows environment? I suspect the way it deals with UTF8 and GBK is the main problem.

ramnathv commented 10 years ago

You are right. I believe the issue is a combination of Windows + Encoding. Let me see if I can test under Windows and get back on this.

hetong007 commented 10 years ago

Most Chinese users are suffering from it because Windows is still the most popular OS in China. A lot of users would benefit from fixing this issue :)

ramnathv commented 10 years ago

Can you try this @hetong007 ? It runs the index.Rmd through knitr directly, before passing it on to Slidify. This solutions has fixed some problems with encoding, and I wanted to check if it has any effect on this problem.

slidify(knit("index.Rmd", encoding = 'GBK'), knit_deck = FALSE)
hetong007 commented 10 years ago

I used that code on the GBK file. The result remains exactly the same.

I also tried slidify(knit("index.Rmd", encoding = 'UTF8'), knit_deck = FALSE) on the UTF8 version. Not working either.

ramnathv commented 10 years ago

Okay. Let me try to isolate the problem here. If you run knit2html on your Rmd file, are the characters displaying correctly. Let us first try to make it work with knitr and then focus on how to get slidify working with it.

hetong007 commented 10 years ago

knit2html is not working correctly under Windows. I got error messages.

This is what I got from running it on the GBK version:

> knit2html('index.Rmd')

processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code

output file: index.md

Error in sub("#!r_highlight#", highlight, html, fixed = TRUE) : 
  invalid multibyte string at '<9f><<2f>title>

#!r_highlight#

#!mathjax#

<style type="text/css">
body, td {
   font-family: sans-serif;
   background-color: white;
   font-size: 12px;
   margin: 8px;
}

tt, code, pre {
   font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}

h1 { 
   font-size:2.2em; 
}

h2 { 
   font-size:1.8em; 
}

h3 { 
   font-size:1.4em; 
}

h4 { 
   font-size:1.0em; 
}

h5 { 
   font-size:0.9em; 
}

h6 { 
   font-size:0.8em; 
}

a:visited {
   color: rgb(50%, 0%, 50%);
}

pre {   
   margin-top: 0;
   max-width: 95%;
   border: 1px solid #ccc;
   white-space: pre-wrap;
}

pre code {
   display: block; padding: 0.5em;
}

code.r, code.cpp {
   background-color: #F8F8F8;
}

table, td, th {
  border: none;
}

blockquote {
   color:#666666;
   margin:0;
   padding-left: 1em;
   border-left: 0.5em #EEE solid;
}

hr {
   height: 0px;
   border-bottom: none;
   border-top-width: thin;
   border-top-style: dotted;

This is what I got from running it on the UTF8 version:

> knit2html('index.Rmd')

processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code

output file: index.md

Error in substring(u, so, so + ml - 1L) : 
  invalid multibyte string at '<9f><<2f>h2>

<hr/>

<blockquote>
<ul>
<li>璞嗙摚鐢靛奖涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy2.png" alt=""/></li>
</ul></li>
<li>璞嗙摚闊充箰涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy3.png" alt=""/></li>
</ul></li>
<li>璞嗙摚闃呰涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy4.png" alt=""/></li>
</ul></li>
</ul>
</blockquote>

<hr/>

<h2>浠€涔堟槸鏍囩</h2>

<blockquote>
<ul>
<li>鐢ㄦ埛涓诲姩鐢熸垚</li>
<li>瀵规枃瀛楀唴瀹逛笉鍔犻檺鍒<b6></li>
<li>鏄鐗╁搧鏈夌泭鐨勮ˉ鍏呰鏄庝俊鎭<af></li>
<li>鑻辨枃閲岀О杩欐牱鐨勪笢瑗垮彨鍋<9a><strong>folksonomy</strong>(folk+taxonomy)锛屽苟涓嶆槸<em>tag</em></li>
</ul>
</blockquote>

<hr/>

<h2>鏍囩鏃犲涓嶅湪</h2>

<p>闄や簡璞嗙摚锛屽叾瀹炶繕鏈夊緢澶氬湴鏂瑰嚭鐜颁簡鏍囩锛<9a></p>

<blockquote>
<ul>
<li>鏂版氮寰崥涓殑鏍囩

<ul>
<li><img src="pics/folksonomy_is_everywhere1.png" alt=""/></li>
</ul></li>
<li>缁熻涔嬮兘涓殑鏍囩

<ul>
<li><img src="pic
ramnathv commented 10 years ago

You need to explicitly pass the encoding to knit2html using knit2html('index.Rmd', encoding = "GBK").

hetong007 commented 10 years ago

Sorry, but the result still remains the same :(

ramnathv commented 10 years ago

Okay. Can you save your Rmd file and provide me a link to it? Don't copy paste it as I want to ensure that it is saved with the correct encoding. Since you are having trouble using knit2html as well, @yihui may have some idea as to what might be messing things up. Also print your sessionInfo() so that we know the versions of all packages that were loaded in your R Console.

hetong007 commented 10 years ago

@yihui is not a Windows user, maybe he chose to ignore those errors before :(

Here is a repo I just created with the Rmd files index-GBK.Rmd and index-UTF8.Rmd. Also, sessionInfo.txt has the result from sessionInfo().

ramnathv commented 10 years ago

Well knitr has lots of Windows users and I have seen @yihui do a lot of encoding related work. If there is an R expert on encoding, my money will be on @yihui :)

hetong007 commented 10 years ago

Chinese programmers suffer from encoding related problems everyday. Thank you and good luck! :)

yihui commented 10 years ago

I think I know what is the problem, but it will take me a while to find out where the character encoding got messed up. The encoding of this page https://github.com/hetong007/temp_files/blob/master/index-GBK.html is not UTF-8, but it contains the spec <meta charset="utf-8">, which is wrong. Actually this page contains characters with different encodings: some are UTF-8 and some are GBK. It might be the problem of slidify, slidifyLibraries, whisker, or markdown.

@hetong007 I rarely use Windows myself, but that does not mean I do not care about Windows users :)

ramnathv commented 10 years ago

@yihui, I understand why slidify fails on this file. The <meta charset="utf-8"> is from from the slidifyLIbraries template for the io2012 library, and can be fixed by modifying this line in the libraries folder.

The failure of knit2html is possibly explained either by the mixed encoding, or the utf-8 encoding specified in the default template

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

I am thinking @hetong007 needs to convert the entire document to GBK or UTF-8 and the modify the template, if he were using GBK. Does that sound about right @yihui ? Thanks for taking a look at this.

yihui commented 10 years ago

I'll take a look at @kohske's PR rstudio/markdown#49 and rstudio/markdown#50. The problem should be at least alleviated after the encoding problem is gone in the markdown package, although there are still other places that may have to be fixed.

ramnathv commented 10 years ago

Thanks @yihui. I will look forward to these fixes. I presume that these issues are non-existent with rmarkdown or is encoding handling still going to be tricky?

kohske commented 10 years ago

FYI, here is the fix of encoding for markdown, slidify, and knitrBootstrap. I hope someone else also will test this, and confirm it does not break any existing codes.

The below is the test script and markdown files: http://kohske.github.io/sandbox/knit-encode.zip

kohske

kohske commented 10 years ago

I tested the UTF8 file including GBK characters (below) and slidify works perfectly on Windows!! https://github.com/hetong007/Douban_Folksonomy/blob/master/index.Rmd

Note that before running slidiy, change the locale's code page to 936.

ramnathv commented 10 years ago

Thanks @kohske. This is a really significant contribution as it opens up things for a large group of users. I will run through the tests and merge this weekend. Can you add yourself as a contributor in the DESCRIPTION file?

hetong007 commented 10 years ago

@kohske Thanks, this solution works perfectly on my Windows XP!

Meanwhile, the framework of the generated slides is not the same as before, i.e. io2012 is not applied to the generated file. Is it caused by the dev version of slidify @ramnathv ?

ramnathv commented 10 years ago

Are you using RStudio? If yes, what version? If you can paste a screenshot of the output you get, that would be useful for me to figure out what might be going on.

kohske commented 10 years ago

@ramnathv Okay, thanks. Note that MBCS-compatible slidify requires MBCS-compatible markdown package.

hetong007 commented 10 years ago

@kohske After the code install_github("kohske/knitrBootstrap@fix/encode", quick=TRUE), there's a warning saying package ‘’ is not available (for R version 3.0.2) . The name of the 'missing' package is empty. Is it a tiny bug or I just missed something? Thank you.

kohske commented 10 years ago

@hetong007 This is due to DESCRIPTION of knitrBootstrap. R (> 3.0.0), should be R (> 3.0.0) Please just ignore the warning. Thanks for your test and report!!

hetong007 commented 10 years ago

@ramnathv I am using the newest RStudio, i.e. 0.98.692. Under dev_mode(), and I am generating the html file with only the pics folder and the index.Rmd file from the original repository.

The output information is

d> slidify("Douban_Folksonomy-master/index.Rmd", encoding="UTF8")
processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code
output file: index.md
Copying files to libraries/frameworks/io2012...
Copying files to libraries/highlighters/highlight.js...
Copying files to libraries/widgets/bootstrap...
Warning messages:
1: In readLines(con, ...) : incomplete final line found on 'index.Rmd'
2: In readLines(con, ...) : incomplete final line found on 'index.Rmd'

And the first page looks like

io2012 not working

The second page looks like

io2012 not working 2

Comparing to this original version, it is not hard to find the significant difference.

kohske commented 10 years ago

@hetong007 Obviously the libraries in the original repository is quite old. The results are same to the newer version by generating under Mac OS X.

ramnathv commented 10 years ago

@kohske is right. I updated the default stylesheets for io2012, adding the bottle green background in the title slide and the blue color for slide titles. You can always modify it, if you prefer a different appearance of the slides.

hetong007 commented 10 years ago

@ramnathv @kohske Thanks for pointing that out. Then I would say Chinese users (maybe including Japanese and other users as well) will enjoy slidify in Windows! Thanks :)

ramnathv commented 10 years ago

Thanks to @kohske for so diligently plugging away on this. Encoding issues are not the most pleasant ones to be working on, but are so critical. I will try to merge this pull request this weekend, after ensuring that it doesn't break any other features of slidify. @kohske, please add yourself as a contributor in the DESCRIPTION!

kohske commented 10 years ago

@ramnathv I did it, thanks.

ramnathv commented 10 years ago

Thanks to @kohske, I just merged in some changes that provide for better encoding support. You can install it from the fix-encode branch.

library(devtools)
install_github("ramnathv/slidify@fix-encode")

Can you install it and test if it solves the encoding issues you had mentioned here?

hetong007 commented 10 years ago

This fix everything on my system. But I am using Win 7 instead of Win XP now. I hope it doesn't matter.

I created two Rmd files in GB2312 and UTF8 respectively, and ran the following code:

library(devtools)
install_github("ramnathv/slidify@fix-encode")

# setwd(...)

require(slidify)

slidify('index.Rmd', encoding='CP936')
slidify('index-UTF8.Rmd', encoding='UTF8')

The result is great.

Thank you @ramnathv and @kohske

kohske commented 10 years ago

Thanks @ramnathv, everything works perfectly with Japanese_Japan.CP932 and UTF8 under Win7.

suensummit commented 9 years ago

Thanks all your efforts! @hetong007 @ramnathv @kohske This patch works well with Traditional Chinese under Win8 (with encoding UTF8) as well, great job done!

ramnathv commented 9 years ago

All credit should go to @kohske for painstakingly working on fixing encoding related issues.

yihui commented 9 years ago

Is the fix-encode branch ready to be merged, then?

ramnathv commented 9 years ago

Yes. I will be merging it this weekend, when I will be working on slidify.