Closed pablobarbera closed 7 years ago
If you've got Java 8 installed, which I guess you might if you're using lexicoder, then you could use jca while this gets fixed.
@pablobarbera This feature is not live yet, but there are workarounds using "phrasetotoken". But lots of demand for this exists so we need to add it asap.
@koheiw let's talk about how the newest selectFeatures mods could be used for this purpose, prior to dfm construction.
Joining tokens before applying dfm is the way. It is not automated, but I am sure that you can extract multi-part words easily from the dictionary.
seqs <- list(c('should', 'work'), c('united', 'states'))
toks2 <- joinTokens(toks, seqs, "_")
@koheiw I was working today on a foundation to get this issue resolved, in the dev_multiWordDictionaries
branch. Please see the changes I made to joinTokens()
.
phrasetotoken
with a signature for tokenizedTexts, vector/list of phrases. The function I added for this (bottom of phrases.R
) currently calls joinTokens()
. case_insensitive
argument, which I added, but it does not work in your C++ code that performs the substitutions. Can you look at the code and think about how to make this work? It translates nicely to the stringi match functions but does not work in your code.regexToFixed()
, that converts the regex patterns to the fixed elements found in the set of types, so that the fixed matching can be used. I took it out of joinTokens()
so that other code could use it too.Please work directly on the dev_multiWordDictionaries
branch.
I fixed many of the problems. The key is the way to use the code now in grid_sequence
(regexToFixed
). joinTokens()
has to generate all possible patterns for case-insensitive concatenation. There were some issues in C++ as well, but I fixed them.
Is there a written statement of the semantics of multiword pattern matching someplace? I'd be happy to help on this branch, but my experience from jca is that a) this is trickier than it looks, and b) treating multiword matches as single tokens makes various marginal statistics subtly inconsistent.
Hello! Any idea when this problem might be solved? I would like to use multiword dictionaries in text analyses. Joining tokens does not seem to be the right way here because of various issues (e.g. tampering with the word counts).
Not yet, but I'd say by mid-September!
Great news! Thanks a lot!
I added new functionality to implement dictionaries for with multi-word keys, new method applyDictionary.tokenizedTexts()
. In master as of v.0.9.8.8. See the examples in test_dictionary_multiword.R.
From commit https://github.com/kbenoit/quanteda/commit/d76972a3adf95fe9bbe969756796f234d2e7d331 (but includes loads of updates from merging in the master, since that branch had gotten quite stale).
It includes a double match for a spots and a country category for the phrase "Manchester United States" - @conjugateprior is addressing this sort of issue what you meant by defining the semantics of multi-word matching?
Still need to implement pattern matches, as well as a method for the hashed tokenizedTexts object class, but these will not be difficult.
@koheiw Happy to compare speeds with the C++ versions - what I wrote today in R is pretty smokin' fast and in fact, much faster even for single-word dictionary keys than creating a dfm and then applying the dictionary.
It is great that @ken can implement join tokens in R.
How much are C++ implementations are faster/slower vis-a-vis R's vectorized operations? has been my question, so I tested in this occasion.
It seems that the unordered_set with strings are much slower than R, but it is a bit faster with integer keys.
Unit: microseconds
expr min lq mean median uq max neval
set_std(toks, c("a", "c", "d")) 1417.239 1500.1485 1579.4765 1534.8090 1573.5245 3682.633 1000
set_std_num(toks_hash, c(1, 3, 4)) 351.644 377.7445 569.2130 396.4585 413.8055 62986.509 1000
set_r(toks, c("a", "c", "d")) 583.646 617.1745 868.3172 637.3545 670.2990 3576.935 1000
set_r(toks_hash, c(1, 3, 4)) 560.589 589.0605 881.4517 611.7505 638.6125 60450.674 1000
I hope this will help us deciding how to proceed in quaneda development. The beanchmark code is in test_unordered_set.cpp in dev_hashing branch.
applyDictionary.tokens() is working but a bit different from its tokenizedTexts version. The difference appeared in the test around line 150.
d3 = "It's Arsenal versus Manchester United, states the announcer."
applyDictionary.tokens(): [1] "team" "team" "Countries"
applyDictionary.tokenizedTexts(): [1] "Countries" "team" "team"
Since a tokens object is a ordered list of tokens, the first result is right.
You're talking here about PR #288 I think - please change the title of your function applyDictionary()
in PR #288 to something like applyDictionary_cpp
and add some tests to compare it to the R-based applyDictionary()
that is already in the master. These can be benchmarks but also should be testthat tests, similar to those that already exist between applyDictionary.tokenizedTexts()
and applyDictionary.tokens()
.
It's ok to use setequal()
or some variant to make the order irrelevant.
Does applyDictionary.tokens() in master support regex and glob?
No not yet, but I can get it to using your function grid_sequence
and then matching on the fixed. This worked easier in selectFeatures.tokens()
that I wrote yesterday since it just required a match of the selection features to the types, but it has to work differently (I think) for applyDictionary()
.
This is issue #292.
@koheiw if you want to address #292 let me know and I will make some suggestions. Whatever you do, it should be on a new branch from master, only for this issue!
I am on dev_multi_key_dict2, and it merged master.
Yes but if you want to fix something currently in master
, you need to branch, fix, issue a PR for that issue. The idea of a branch is to contain a specific fix, not to be the area where you work.
Until I can review PR #288 please leave dev_multi_key_dict2
alone.
I am comparing different implementations of applyDictionary, and found one for tokenizedTexts is strange for two reasons (1) it does not keep the original order of the tokens, and (2) categories not found in the tokens do not appear in dfm as all zero.
I want to branch out master, but I cannot because my code is not in the master.
Also, I am not writing a fix for a particular issue. This is a different type of work from writing patches.
I will review your code soon, but working on spacyr now...
On the behaviour of applyDictionary.tokenizedTexts
, please file issues for the (1) order and (separately) for the (2) categories issue. But note on (2) that applyDictionary.tokenizedTexts()
does not produce a dfm, it produces a tokenizedTexts consisting of dictionary keys.
This time lag is the issue, I know you are a busy man, but I am working on different issues/features. This makes my branches obsolete and incompatible with each other.
I know that applyDictionary.tokenizedTexts() is for tokens. I found that 'tax' category is dropped in the test cases, because tokens or tokenizedTexts do not have those at all.
I compared speed of different implementations. The results are kind of interesting. Since underlying scanning function is faster in C++, cpp is fastest when texts are long. But r is fasterst when they are short since r is fully vectorized. cppi, which uses a binary dfm for indexing, never be the fastest. It was an effective technique for reducing costs in scanning character strings, but it is no longer so.
In short, both r and cpp are very fast, only taking a fraction of second. I do not think the difference between r and cpp is significant here. We have to test in more realistic settings, using a dictionary with hundreds of words.
> toks <- tokens(inaugCorpus)
> dict <- dictionary(list(country = c("united states", "united kingdom")))
> microbenchmark::microbenchmark(
+ r=applyDictionary(toks, dict, valuetype='fixed', verbose=FALSE),
+ cpp=applyDictionary2(toks, dict, valuetype='fixed', verbose=FALSE, indexing=FALSE),
+ cppi=applyDictionary2(toks, dict, valuetype='fixed', verbose=FALSE, indexing=TRUE)
+ )
Unit: milliseconds
expr min lq mean median uq max neval
r 10.790228 18.083893 22.97068 18.543844 19.10421 130.89164 100
cpp 6.837882 7.564767 11.04115 8.311319 15.10642 18.45838 100
cppi 15.418777 16.978413 23.83191 23.863639 24.61981 131.82470 100
> toks_short <- tokens(tokenize(inaugCorpus, what='sentence', simplify=TRUE))
> microbenchmark::microbenchmark(
+ r=applyDictionary(toks_short, dict, valuetype='fixed', verbose=FALSE),
+ cpp=applyDictionary2(toks_short, dict, valuetype='fixed', verbose=FALSE, indexing=FALSE),
+ cppi=applyDictionary2(toks_short, dict, valuetype='fixed', verbose=FALSE, indexing=TRUE)
+ )
Unit: milliseconds
expr min lq mean median uq max neval
r 18.71877 28.2335 33.73676 29.53169 31.17191 155.3387 100
cpp 114.84927 130.4477 153.71015 143.20885 161.76700 339.8274 100
cppi 152.29223 166.3599 187.41548 177.54386 191.05902 297.8989 100
Here is the test with LIWC dictionary (10,606 entries). The value type is 'fixed' in the r version, but it means something. These are taking 20 seconds. Do we might need to parallelize the process?
> microbenchmark::microbenchmark(
+ r=applyDictionary(toks, dict_liwc, valuetype='fixed', verbose=FALSE),
+ cpp=applyDictionary2(toks, dict_liwc, valuetype='glob', verbose=FALSE, indexing=FALSE),
+ cppi=applyDictionary2(toks, dict_liwc, valuetype='glob', verbose=FALSE, indexing=TRUE),
+ times=1
+ )
Unit: seconds
expr min lq mean median uq max neval
r 23.07132 23.07132 23.07132 23.07132 23.07132 23.07132 1
cpp 20.67580 20.67580 20.67580 20.67580 20.67580 20.67580 1
cppi 23.19090 23.19090 23.19090 23.19090 23.19090 23.19090 1
Very interesting, thanks. Please preserve these benchmarks into an appropriately named file or folder within benchmarks/
.
Comments:
I came up with a fast one-pass algorithm based on our ngram generator. Looking up 10,000 multi-key entries in STOUCorpus, it takes 0.1 second! Here old is the above mentioned C++ function.
Unit: milliseconds
expr min lq mean median uq max
qatd_cpp_lookup_int_list(toks, toks_loc, dict, 99) 689.4275 689.4275 689.4275 689.4275 689.4275 689.4275
qatd_cpp_lookup_int_list2(toks, toks_loc, dict, 99) 104.1731 104.1731 104.1731 104.1731 104.1731 104.1731
old(toks, toks_loc, dict) 23472.3723 23472.3723 23472.3723 23472.3723 23472.3723 23472.3723
neval
1
1
1
That sounds like a massive difference alright! But please put it into a new branch where I clearly understand what is the new code, versus artefacts from the branching.
Once dev_multi_key_dict2 is merged, I will make dev_multi_key_dict3 and put all the files into it.
Should be fixed now, >= v0.9.8.9025:
toks <- tokens(data_corpus_inaugural[1:5])
dict <- dictionary(list(country = "united states",
HOR = c("House of Re*"),
law=c('law*', 'constitution'),
freedom=c('free*', 'libert*')))
dfm(tokens_lookup(toks, dict))
## 5 x 4 sparse Matrix of class "dfmSparse"
## features
## docs country congress law freedom
## 1789-Washington 2 2 1 6
## 1793-Washington 0 0 1 0
## 1797-Adams 3 0 10 6
## 1801-Jefferson 0 0 6 11
## 1805-Jefferson 1 0 13 6
The following code is a simplified version of the example in quanteda::dictionary
Another example:
Trying with "_" as concatenator:
Am I missing something? This came up as I was trying to use quanteda with Lexicoder, which has multi-word entries in the dictionary. I'm running quanteda 0.9.6-9
┆Issue is synchronized with this Asana task