trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

Does qdap support analyses of scripts written by other languages than English? #211

Closed cognitivepsychology closed 9 years ago

cognitivepsychology commented 9 years ago

I'm a newbie to qdap, who tries performing an qualitative analysis on text data written by Korean (not English, obviously). When I ran functions that search or count words, however, they didn't work showing error messages as follows:

all_words(whole.nv.df$bigram, contains= "차") Error in gsub(K[i, 1], K[i, 2], x, fixed = fixed, ...) : invalid multibyte string at '<9d><84>jko'

Does this mean that qdap doesn't support analyses of scripts written by other languages than English? Or any other ways to perform analyses on non-English text data using qdap? Please, give me your advice. I have to complete an article manuscript in a short period of time, so I need your help desperately. Thank you, in advance.

trinker commented 9 years ago

qdap is designed for English. That being said many functions, including word counts work on languages other than English. However R is limited in its encoding in that special characters aren't recognized or throw errors. As Korean is a picture based sign system many (maybe even all) are special characters. You'd have to fix encodings to make it work with qdap. I don't know Korean and can't give specifics but I think qdap and maybe even R are not the right tool(s) for the job. On May 29, 2015 11:21 AM, "cognitivepsychology" notifications@github.com wrote:

I'm a newbie to qdap, who tries performing an qualitative analysis on text data written by Korean (not English, obviously). When I ran functions that search or count words, however, they didn't work showing error messages as follows:

all_words(whole.nv.df$bigram, contains= "차") Error in gsub(K[i, 1], K[i, 2], x, fixed = fixed, ...) : invalid multibyte string at '<9d><84>jko'

Does this mean that qdap doesn't support analyses of scripts written by other languages than English? Or any other ways to perform analyses on non-English text data using qdap? Please, give me your advice. I have to complete an article manuscript in a short period of time, so I need your help desperately. Thank you, in advance.

— Reply to this email directly or view it on GitHub https://github.com/trinker/qdap/issues/211.

cognitivepsychology commented 9 years ago

Thank you for your answer. I encoded my text data in the form of ANSI, and R did the perfect job of trimming my data when performing other basic functions. Even tm package seems to read it quite well. Should I encode my data in other form which qdap can read? If then, what kind of form of encoding should I choose?

trinker commented 9 years ago
  1. I strongly advise that qdap is not an appropriate tool for this task, particularly if you are under a deadline. It was designed for English text. The tm package, on the other hand, was designed for general language use. If you are not familiar with qdap or Korean it is difficult to know if the results are valid when used on another language. I for one am unfamiliar with the structures of Korean and so I can't advise if qdap wil give correct results. Granted, things like word counts should work on other languages as I've used them on Spanish and French provided you've cleaned everything properly. There is an entire vignette dedicated to cleaning...
  2. Secondly, you haven't provided any data for me to reproduce the problem. A minimal working example is expected any time you are asking for help with errors. It is incumbent upon you to reproduce a problem, otherwise I'm guessing as to what is causing the error. One of qdap's vignettes describes how to do this in detail. http://cran.r-project.org/web/packages/qdap/vignettes/cleaning_and_debugging.pdf Please read and follow this guide to ensure questions can yield results.

If you provide a minimal working example I can likely help to overcome the error but still am unsure if the results will be valid.

Here is an example of how a base R function, gsub, which is used heavily in text mining fails on a simple Korean string:

gsub("국", "<>", "한국어/조선말", fixed=TRUE)

[1] "<><><>/<><><>"

Translating into English may be a viable option but you'd certainly lose some of the structures in the translation.

On Fri, May 29, 2015 at 12:36 PM, cognitivepsychology < notifications@github.com> wrote:

Thank you for your answer. I encoded my text data in the form of ANSI, and R did the perfect job of trimming my data when performing other basic functions. Even tm package seems to read it quite well. Should I encond my data in other form?

— Reply to this email directly or view it on GitHub https://github.com/trinker/qdap/issues/211#issuecomment-106866228.

cognitivepsychology commented 9 years ago

First, I am sorry for not presenting any data so you can reproduce my problem and suggest solutions. The structure of my data consists of 3 character variables whose names are word1, word2 and bigram. Basically, they are tokenized and tagged Korean words and bigrams (a bigram in a row is combination of word1 and word2). I'd like to compute frequency of each element of word1, word2 and bigram, and have 3 columns for frequency of each element's occurances as a result. I tried to find frequency of occurances of them using tm package, but it had great difficulty running my data and threw an error message complaining about the need for more memory or something. I don't think the memory issue is the real cause of the problem, because my data is very light and simple. It is composed of three columns of simple word sets as follows (the real data has appoximately 25,000 rows like this in it):

sample qdap

I know it might be wiser to find other ways to analyze it, and already found other not-that-elegant-but-beggars-can't-be-choosers solutions, acutally (many pattern matching R base functions and packages were needed, and worked well in Korean language environment). However, your package seems to provide many very promising and beautiful functions, so I'd like to find ways to utilize it somehow if possible, Thank you very much for your time. I look forward to your advice.

trinker commented 9 years ago

Can you use dput to share the data? I can't read this easily into R.

On Sat, May 30, 2015 at 5:29 AM, cognitivepsychology < notifications@github.com> wrote:

First, I am sorry for presenting any data so you can reproduce my problem and suggest solutions. The structure of my data consists of 3 character variables whose names are word1, word2 and bigram. Basically, they are tokenized and tagged Korean words and bigrams (a bigram in a row is combination of word1 and word2). I'd like to compute frequency of each element of word1, word2 and bigram, and have 3 columns for frequency of each element's occurances as a result. I tried to find frequency of occurances of them using tm package, but it had great difficulty running my data and threw an error message complaining about the need for more memory or something. I don't think the memory issue is the real cause of the problem, because my data is very light and simple. It is composed of three columns of simple word sets as follows (the real data has appoximately 25,000 rows like this in it, and you can download a small sample of my R data file at http://1drv.ms/1eFMfSV):

    word1                word2                    bigram

1 차NNG을JKO 타VV 차NNG을JKO 타VV 2 차NNG을JKO 타VV 차NNG을JKO 타VV 3 도박NNG을JKO 하VV 도박NNG을JKO 하VV 4 신문NNG을JKO 보VV 신문NNG을JKO 보VV 5 노력NNG을JKO 하VV 노력NNG을JKO 하VV 6 신문NNG을JKO 보VV 신문NNG을JKO 보VV 7 졸음끼NNG을JKO 느끼VV 졸음끼NNG을JKO 느끼VV 8 시작NNG을JKO 하VV 시작NNG을JKO 하VV 9 자리NNG을JKO 차지NNG하XSV 자리NNG을JKO 차지NNG하XSV 10 척NNB을JKO 하VV 척NNB을JKO 하VV 11 차NNG을JKO 타VV 차NNG을JKO 타VV 12 차NNG을JKO 타VV 차NNG을JKO 타VV 13 운전NNG을JKO 하VV 운전NNG을JKO 하VV 14 신문NNG을JKO 보VV 신문NNG을JKO 보VV 15 글자NNG을JKO 보VV 글자NNG을JKO 보VV 16 월드컵NNP을JKO 앞두VV 월드컵NNP을JKO 앞두VV 17 문화NNG을JKO 개선NNG하XSV 문화NNG을JKO 개선NNG하XSV 18 비교NNG을JKO 하VV 비교NNG을JKO 하VV 19 시간NNG을JKO 지키VV 시간NNG을JKO 지키VV 20 학NNG을JKO 때VV 학NNG을JKO 때VV

I know it might be wiser to find other ways to analyze it, and already found other not-that-elegant-but-beggar-can't-be-choosers solutions, acutally (many pattern matching R base functions and packages were needed, and worked well in Korean language environment). However, your package seems to provide many very promising and beautiful functions, so I'd like to find ways to utilize it somehow if possible, Thank you very much for your time. I look forward to your advice.

— Reply to this email directly or view it on GitHub https://github.com/trinker/qdap/issues/211#issuecomment-107016700.

trinker commented 9 years ago

Put the data directly here using markdown markup. Your data is distorted there because of special characters. This is what I see:

cognitivepsychology commented 9 years ago

I'm sorry for the trouble. I thought Korean characters could be protected if they were saved in the form of a text file, but I was wrong. Please see a small sample of my data below:

 dput(sample.qdap)
structure(list(word1 = structure(c(13L, 13L, 3L, 8L, 2L, 8L, 
12L, 7L, 11L, 14L, 13L, 13L, 9L, 8L, 1L, 10L, 4L, 5L, 6L, 15L
), .Label = c("글자NNG을JKO", "노력NNG을JKO", "도박NNG을JKO", 
"문화NNG을JKO", "비교NNG을JKO", "시간NNG을JKO", "시작NNG을JKO", 
"신문NNG을JKO", "운전NNG을JKO", "월드컵NNP을JKO", "자리NNG을JKO", 
"졸음끼NNG을JKO", "차NNG을JKO", "척NNB을JKO", "학NNG을JKO"), class = "factor"), 
    word2 = structure(c(8L, 8L, 9L, 4L, 9L, 4L, 2L, 9L, 7L, 9L, 
    8L, 8L, 9L, 4L, 4L, 5L, 1L, 9L, 6L, 3L), .Label = c("개선NNG하XSV", 
    "느끼VV", "때VV", "보VV", "앞두VV", "지키VV", "차지NNG하XSV", 
    "타VV", "하VV"), class = "factor"), bigram = structure(c(13L, 
    13L, 3L, 8L, 2L, 8L, 12L, 7L, 11L, 14L, 13L, 13L, 9L, 8L, 
    1L, 10L, 4L, 5L, 6L, 15L), .Label = c("글자NNG을JKO 보VV.", 
    "노력NNG을JKO 하VV.", "도박NNG을JKO 하VV.", "문화NNG을JKO 개선NNG하XSV", 
    "비교NNG을JKO 하VV.", "시간NNG을JKO 지키VV.", "시작NNG을JKO 하VV.", 
    "신문NNG을JKO 보VV.", "운전NNG을JKO 하VV.", "월드컵NNP을JKO 앞두VV.", 
    "자리NNG을JKO 차지NNG하XSV", "졸음끼NNG을JKO 느끼VV.", "차NNG을JKO 타VV.", 
    "척NNB을JKO 하VV.", "학NNG을JKO 때VV."), class = "factor")), .Names = c("word1", 
"word2", "bigram"), class = "data.frame", row.names = c(NA, -20L
))