quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
840 stars 188 forks source link

Multi-word entries in dictionary appear to be ignored #188

Closed pablobarbera closed 7 years ago

pablobarbera commented 8 years ago

The following code is a simplified version of the example in quanteda::dictionary

mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- dictionary(list(country = "united states"))
sum(dfm(mycorpus, dictionary = mydict)[,"country"])

Another example:

mycorpus <- corpus("this should work")
mydict <- dictionary(list(example = "should work"))
sum(dfm(mycorpus, dictionary = mydict)[,"example"])

Trying with "_" as concatenator:

mycorpus <- corpus("this should work")
mydict <- dictionary(list(example = "should_work"), concatenator="_")
sum(dfm(mycorpus, dictionary = mydict)[,"example"])

Am I missing something? This came up as I was trying to use quanteda with Lexicoder, which has multi-word entries in the dictionary. I'm running quanteda 0.9.6-9

┆Issue is synchronized with this Asana task

conjugateprior commented 8 years ago

If you've got Java 8 installed, which I guess you might if you're using lexicoder, then you could use jca while this gets fixed.

kbenoit commented 8 years ago

@pablobarbera This feature is not live yet, but there are workarounds using "phrasetotoken". But lots of demand for this exists so we need to add it asap.

@koheiw let's talk about how the newest selectFeatures mods could be used for this purpose, prior to dfm construction.

koheiw commented 8 years ago

Joining tokens before applying dfm is the way. It is not automated, but I am sure that you can extract multi-part words easily from the dictionary.

seqs <- list(c('should', 'work'), c('united', 'states'))
toks2 <- joinTokens(toks, seqs, "_")
kbenoit commented 8 years ago

@koheiw I was working today on a foundation to get this issue resolved, in the dev_multiWordDictionaries branch. Please see the changes I made to joinTokens().

  1. I will eventually fold this function into phrasetotoken with a signature for tokenizedTexts, vector/list of phrases. The function I added for this (bottom of phrases.R) currently calls joinTokens().
  2. The examples in the roxygen2 header show some cases where it's not working. There is a bizarre outcome based on which match is first, probably because of the loop's sequential processing of the token sequences. Can you investigate this? (See the @examples)
  3. We need a case_insensitive argument, which I added, but it does not work in your C++ code that performs the substitutions. Can you look at the code and think about how to make this work? It translates nicely to the stringi match functions but does not work in your code.
  4. Note that I removed the code to a new function, regexToFixed(), that converts the regex patterns to the fixed elements found in the set of types, so that the fixed matching can be used. I took it out of joinTokens() so that other code could use it too.

Please work directly on the dev_multiWordDictionaries branch.

koheiw commented 8 years ago

I fixed many of the problems. The key is the way to use the code now in grid_sequence (regexToFixed). joinTokens() has to generate all possible patterns for case-insensitive concatenation. There were some issues in C++ as well, but I fixed them.

conjugateprior commented 8 years ago

Is there a written statement of the semantics of multiword pattern matching someplace? I'd be happy to help on this branch, but my experience from jca is that a) this is trickier than it looks, and b) treating multiword matches as single tokens makes various marginal statistics subtly inconsistent.

Seneca75 commented 8 years ago

Hello! Any idea when this problem might be solved? I would like to use multiword dictionaries in text analyses. Joining tokens does not seem to be the right way here because of various issues (e.g. tampering with the word counts).

kbenoit commented 8 years ago

Not yet, but I'd say by mid-September!

Seneca75 commented 8 years ago

Great news! Thanks a lot!

kbenoit commented 7 years ago

I added new functionality to implement dictionaries for with multi-word keys, new method applyDictionary.tokenizedTexts(). In master as of v.0.9.8.8. See the examples in test_dictionary_multiword.R.
From commit https://github.com/kbenoit/quanteda/commit/d76972a3adf95fe9bbe969756796f234d2e7d331 (but includes loads of updates from merging in the master, since that branch had gotten quite stale).

It includes a double match for a spots and a country category for the phrase "Manchester United States" - @conjugateprior is addressing this sort of issue what you meant by defining the semantics of multi-word matching?

Still need to implement pattern matches, as well as a method for the hashed tokenizedTexts object class, but these will not be difficult.

@koheiw Happy to compare speeds with the C++ versions - what I wrote today in R is pretty smokin' fast and in fact, much faster even for single-word dictionary keys than creating a dfm and then applying the dictionary.

koheiw commented 7 years ago

It is great that @ken can implement join tokens in R.

How much are C++ implementations are faster/slower vis-a-vis R's vectorized operations? has been my question, so I tested in this occasion.

It seems that the unordered_set with strings are much slower than R, but it is a bit faster with integer keys.

Unit: microseconds
                               expr      min        lq      mean    median        uq       max neval
    set_std(toks, c("a", "c", "d")) 1417.239 1500.1485 1579.4765 1534.8090 1573.5245  3682.633  1000
 set_std_num(toks_hash, c(1, 3, 4))  351.644  377.7445  569.2130  396.4585  413.8055 62986.509  1000
      set_r(toks, c("a", "c", "d"))  583.646  617.1745  868.3172  637.3545  670.2990  3576.935  1000
       set_r(toks_hash, c(1, 3, 4))  560.589  589.0605  881.4517  611.7505  638.6125 60450.674  1000

I hope this will help us deciding how to proceed in quaneda development. The beanchmark code is in test_unordered_set.cpp in dev_hashing branch.

koheiw commented 7 years ago

applyDictionary.tokens() is working but a bit different from its tokenizedTexts version. The difference appeared in the test around line 150.

d3 = "It's Arsenal versus Manchester United, states the announcer."

applyDictionary.tokens(): [1] "team" "team" "Countries"

applyDictionary.tokenizedTexts(): [1] "Countries" "team" "team"

Since a tokens object is a ordered list of tokens, the first result is right.

kbenoit commented 7 years ago

You're talking here about PR #288 I think - please change the title of your function applyDictionary() in PR #288 to something like applyDictionary_cpp and add some tests to compare it to the R-based applyDictionary() that is already in the master. These can be benchmarks but also should be testthat tests, similar to those that already exist between applyDictionary.tokenizedTexts() and applyDictionary.tokens().

It's ok to use setequal() or some variant to make the order irrelevant.

koheiw commented 7 years ago

Does applyDictionary.tokens() in master support regex and glob?

kbenoit commented 7 years ago

No not yet, but I can get it to using your function grid_sequence and then matching on the fixed. This worked easier in selectFeatures.tokens() that I wrote yesterday since it just required a match of the selection features to the types, but it has to work differently (I think) for applyDictionary().

This is issue #292.

kbenoit commented 7 years ago

@koheiw if you want to address #292 let me know and I will make some suggestions. Whatever you do, it should be on a new branch from master, only for this issue!

koheiw commented 7 years ago

I am on dev_multi_key_dict2, and it merged master.

kbenoit commented 7 years ago

Yes but if you want to fix something currently in master, you need to branch, fix, issue a PR for that issue. The idea of a branch is to contain a specific fix, not to be the area where you work.

Until I can review PR #288 please leave dev_multi_key_dict2 alone.

koheiw commented 7 years ago

I am comparing different implementations of applyDictionary, and found one for tokenizedTexts is strange for two reasons (1) it does not keep the original order of the tokens, and (2) categories not found in the tokens do not appear in dfm as all zero.

koheiw commented 7 years ago

I want to branch out master, but I cannot because my code is not in the master.

koheiw commented 7 years ago

Also, I am not writing a fix for a particular issue. This is a different type of work from writing patches.

kbenoit commented 7 years ago

I will review your code soon, but working on spacyr now...

On the behaviour of applyDictionary.tokenizedTexts, please file issues for the (1) order and (separately) for the (2) categories issue. But note on (2) that applyDictionary.tokenizedTexts() does not produce a dfm, it produces a tokenizedTexts consisting of dictionary keys.

koheiw commented 7 years ago

This time lag is the issue, I know you are a busy man, but I am working on different issues/features. This makes my branches obsolete and incompatible with each other.

koheiw commented 7 years ago

I know that applyDictionary.tokenizedTexts() is for tokens. I found that 'tax' category is dropped in the test cases, because tokens or tokenizedTexts do not have those at all.

koheiw commented 7 years ago

I compared speed of different implementations. The results are kind of interesting. Since underlying scanning function is faster in C++, cpp is fastest when texts are long. But r is fasterst when they are short since r is fully vectorized. cppi, which uses a binary dfm for indexing, never be the fastest. It was an effective technique for reducing costs in scanning character strings, but it is no longer so.

In short, both r and cpp are very fast, only taking a fraction of second. I do not think the difference between r and cpp is significant here. We have to test in more realistic settings, using a dictionary with hundreds of words.

> toks <- tokens(inaugCorpus)
> dict <- dictionary(list(country = c("united states", "united kingdom")))
> microbenchmark::microbenchmark(
+   r=applyDictionary(toks, dict, valuetype='fixed', verbose=FALSE),
+   cpp=applyDictionary2(toks, dict, valuetype='fixed', verbose=FALSE, indexing=FALSE),
+   cppi=applyDictionary2(toks, dict, valuetype='fixed', verbose=FALSE, indexing=TRUE)
+ )
Unit: milliseconds
 expr       min        lq     mean    median       uq       max neval
    r 10.790228 18.083893 22.97068 18.543844 19.10421 130.89164   100
  cpp  6.837882  7.564767 11.04115  8.311319 15.10642  18.45838   100
 cppi 15.418777 16.978413 23.83191 23.863639 24.61981 131.82470   100

> toks_short <- tokens(tokenize(inaugCorpus, what='sentence', simplify=TRUE))
> microbenchmark::microbenchmark(
+   r=applyDictionary(toks_short, dict, valuetype='fixed', verbose=FALSE),
+   cpp=applyDictionary2(toks_short, dict, valuetype='fixed', verbose=FALSE, indexing=FALSE),
+   cppi=applyDictionary2(toks_short, dict, valuetype='fixed', verbose=FALSE, indexing=TRUE)
+ )
Unit: milliseconds
 expr       min       lq      mean    median        uq      max neval
    r  18.71877  28.2335  33.73676  29.53169  31.17191 155.3387   100
  cpp 114.84927 130.4477 153.71015 143.20885 161.76700 339.8274   100
 cppi 152.29223 166.3599 187.41548 177.54386 191.05902 297.8989   100
koheiw commented 7 years ago

Here is the test with LIWC dictionary (10,606 entries). The value type is 'fixed' in the r version, but it means something. These are taking 20 seconds. Do we might need to parallelize the process?

> microbenchmark::microbenchmark(
+   r=applyDictionary(toks, dict_liwc, valuetype='fixed', verbose=FALSE),
+   cpp=applyDictionary2(toks, dict_liwc, valuetype='glob', verbose=FALSE, indexing=FALSE),
+   cppi=applyDictionary2(toks, dict_liwc, valuetype='glob', verbose=FALSE, indexing=TRUE),
+   times=1
+ )
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
    r 23.07132 23.07132 23.07132 23.07132 23.07132 23.07132     1
  cpp 20.67580 20.67580 20.67580 20.67580 20.67580 20.67580     1
 cppi 23.19090 23.19090 23.19090 23.19090 23.19090 23.19090     1
kbenoit commented 7 years ago

Very interesting, thanks. Please preserve these benchmarks into an appropriately named file or folder within benchmarks/.

Comments:

  1. In general, I prefer R versions rather than C++, as these are easier to modify and integrate with future changes of the package. I prefer to make it as native as possible unless the speed differences are really great.
  2. These differences do not seem all that great.
  3. Are dictionary applications using quanteda likely to be that mission-critical in terms of speed? How many times will someone need to apply a dictionary? If it were a case of using this in a streaming application, it could be applied incrementally to new documents and the tokens concatenated. But I see that as a rare application for quanteda users.
koheiw commented 7 years ago

I came up with a fast one-pass algorithm based on our ngram generator. Looking up 10,000 multi-key entries in STOUCorpus, it takes 0.1 second! Here old is the above mentioned C++ function.

Unit: milliseconds
                                                expr        min         lq       mean     median         uq        max
  qatd_cpp_lookup_int_list(toks, toks_loc, dict, 99)   689.4275   689.4275   689.4275   689.4275   689.4275   689.4275
 qatd_cpp_lookup_int_list2(toks, toks_loc, dict, 99)   104.1731   104.1731   104.1731   104.1731   104.1731   104.1731
                           old(toks, toks_loc, dict) 23472.3723 23472.3723 23472.3723 23472.3723 23472.3723 23472.3723
 neval
     1
     1
     1
kbenoit commented 7 years ago

That sounds like a massive difference alright! But please put it into a new branch where I clearly understand what is the new code, versus artefacts from the branching.

koheiw commented 7 years ago

Once dev_multi_key_dict2 is merged, I will make dev_multi_key_dict3 and put all the files into it.

kbenoit commented 7 years ago

Should be fixed now, >= v0.9.8.9025:

toks <- tokens(data_corpus_inaugural[1:5])
dict <- dictionary(list(country = "united states",
                        HOR = c("House of Re*"),
                        law=c('law*', 'constitution'), 
                        freedom=c('free*', 'libert*')))
dfm(tokens_lookup(toks, dict))
## 5 x 4 sparse Matrix of class "dfmSparse"
##                  features
## docs              country congress law freedom
##   1789-Washington       2        2   1       6
##   1793-Washington       0        0   1       0
##   1797-Adams            3        0  10       6
##   1801-Jefferson        0        0   6      11
##   1805-Jefferson        1        0  13       6