kasperwelbers / corpustools

An R corpus class for tokenized texts
29 stars 12 forks source link

Could not find function "dtm_compare" #1

Closed AlexIls closed 5 years ago

AlexIls commented 6 years ago

Hey, Thanks for writing the package -- it looks very promising! However, I think you might have forgotten to export the dtm_compare function, as I get an error when calling it.

Error messages:

corpustools::dtm_compare()
Error: dtm_compare' is not an exported object from 'namespace:corpustools'

dtm_compare()
Error in dtm_compare() : could not find function "dtm_compare"

Steps to reproduce:


# Reproducable example
library(corpustools)
tc1 <- create_tcorpus(c("Don't make me run."), doc_column = 'id', split_sentences = TRUE)
tc2 <- create_tcorpus(c("I'm full of chocolate."), doc_column = 'id', split_sentences = TRUE)

# Create DTM
dtm1 <- tc1$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)
dtm2 <- tc2$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)

# Compare DTMs
compare <- dtm_compare(dtm1, dtm2)
Error in dtm_compare(dtm1, dtm2) : could not find function "dtm_compare"
kasperwelbers commented 6 years ago

Hey, thanks for raising this issue.

Currently dtm_compare() is indeed not exported, but is called by the tCorpus compare_corpus method (tc$compare_corpus). I'm currently doing some redesigning to make this a function. I hadn't actually considered that people might want to use dtm_compare directly, but it indeed makes sense, so I'll export it in the next update.

Note that tc$preprocess actually doesn't return a DTM. Our reason for developing corpustools is to stick as much as possible to a tokenlist format, that remembers the positions of tokens, and allows NLP output (POS tags, dependency relations, etc) to be contained. The preprocess method adds a column to the tokenlist with the specified name (in this case 'feature'). You can see this by running tc$tokens, which accesses the tokenlist data.table.

You can then use this column in a corpus comparison. For this, please view the documentation for ?compare_corpus or ?compare_subset (compares a subset of the corpus to the rest of the corpus).

I really need to write a vignette. The best way to see how corpustools works is currently to run ?tCorpus for the documentation hub page. For reference, you can create a dtm with tc$dtm or tc$dfm (for a quanteda document feature matrix).