statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

`cnlp_annotate()` generates runtime error with coreNLP backend #84

Closed joshpersi closed 1 month ago

joshpersi commented 4 months ago

Hello,

I'm very excited to use the cleanNLP package with the coreNLP backend. I'm a first time user of both, and am running into an error when running cnlp_annotate() with the coreNLP backend. Using other backends does not produce this error.

Here is the the script I am using to try and test cleanNLP. Is there anything obvious I'm doing wrong?

# Load required packages
library(reticulate)
library(cleanNLP)

# Install Miniconda, if required
# install_miniconda(force = TRUE)

# Ensure the Miniconda output is set appropriately. My path is: 
# C:/Users/persij/AppData/Local/r-miniconda
miniconda_path()

# Install the stanfordnlp package, which is required for cnlp_download_corenlp(),
# and the cleannlp package, which is required for cnlp_init_corenlp(). Set pip
# = TRUE since these packages aren't on Conda. 
conda_install(packages = c("stanfordnlp", "cleannlp"), pip = TRUE)

# Download the coreNLP model files
cnlp_download_corenlp(lang = "en")
# Produces output like the following: 

# Using the default treebank "en_ewt" for language "en".
# Would you like to download the models for: en_ewt now? (Y/n)
# 
# Default download directory: C:\Users\persij\stanfordnlp_resources
# Hit enter to continue or type an alternate directory.
# 
# Downloading models for: en_ewt
# Download location: C:\Users\persij\stanfordnlp_resources\en_ewt_models.zip
# 100%|██████████| 235M/235M [01:58<00:00, 1.98MB/s] 
# 
# Download complete.  Models saved to: C:\Users\persij\stanfordnlp_resources\en_ewt_models.zip
# Extracting models file for: en_ewt
# Cleaning up...Done.

# Initiate the coreNLP backend. Produces no output:
cnlp_init_corenlp()

# Fails here, generating the following output:
annotation <- cnlp_annotate(input = c(
  "Here is the first text. It is short.",
  "Here's the second. It is short too!",
  "The third text is the shortest."
))
# Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
#   RuntimeError: masked_fill_ only supports boolean masks, but got mask with dtype unsigned char
# Run `reticulate::py_last_error()` for details.

Here is the output from reticulate::py_last_error():

> reticulate::py_last_error()

── Python Exception Message ───────────────────────────────────────────────────────────────────────────────────
Traceback (most recent call last):
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\cleannlp\corenlp.py", line 50, in parseDocument
    doc = self.nlp(text)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\stanfordnlp\pipeline\core.py", line 176, in __call__
    self.process(doc)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\stanfordnlp\pipeline\core.py", line 170, in process
    self.processors[processor_name].process(doc)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\stanfordnlp\pipeline\depparse_processor.py", line 30, in process
    preds += self.trainer.predict(b)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\stanfordnlp\models\depparse\trainer.py", line 72, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\persij\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\stanfordnlp\models\depparse\model.py", line 157, in forward
    unlabeled_scores.masked_fill_(diag, -float('inf'))
RuntimeError: masked_fill_ only supports boolean masks, but got mask with dtype unsigned char

── R Traceback ────────────────────────────────────────────────────────────────────────────────────────────────
    ▆
 1. └─cleanNLP::cnlp_annotate(...)
 2.   └─cleanNLP:::annotate_with_corenlp(input, verbose)
 3.     └─volatiles$corenlp$obj$parseDocument(x, doc_id)
 4.       └─reticulate:::py_call_impl(callable, call_args$unnamed, call_args$named)

And here is session info

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] English_United States.1252

time zone: America/Vancouver
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] cleanNLP_3.0.7    reticulate_1.35.0

loaded via a namespace (and not attached):
 [1] compiler_4.3.2 Matrix_1.6-5   cli_3.6.2      tools_4.3.2    yaml_2.3.8     Rcpp_1.0.12    stringi_1.8.3 
 [8] grid_4.3.2     jsonlite_1.8.8 rlang_1.1.3    renv_1.0.5     png_0.1-8      lattice_0.22-5