r-spark / sparknlp

Other
31 stars 6 forks source link

Setting parameters to local loaded model #9

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hi there,

Thanks very much for this package. It is very useful to text mining in a big corpus.

I loaded a local model of language detector and I would like to set coalescence as FALSE (https://nlp.johnsnowlabs.com/docs/en/annotators#languagedetectordl-language-detection-and-identiffication).

I tried to apply a direct modification in the object but it didn't work.

language_detector <- ml_load(sc, language_detector_model") 

nlp_set_input_cols(language_detector, "sentence")

nlp_set_output_col(language_detector, "language")

language_detector$param_map$coalesceSentences <- "FALSE"

Thanks in advance.

dkincaid commented 4 years ago

Thanks for creating the issue. That is an unimplemented feature that I missed. Let me see about getting something implemented. I should be able to get it in here in a day or two. If you need a fix right away, this should be a work around:

language_detector <- sparklyr:::ml_set_param(language_detector, "coalesceSentences", FALSE)
dkincaid commented 4 years ago

Well that was easier than I thought it was going to be. It should be fixed now. There is a new function named nlp_set_param() that you can use. It would look like this:

language_detector <- nlp_set_param(language_detector, "coalesce_sentences", FALSE)

Please let me know if you have a chance to try it. Just reinstall from the master branch.

ghost commented 4 years ago

Hi @dkincaid,

I installed from the master branch and worked as expected. There is a language detection for each sentence as you may seen below:

# Source: spark<?> [?? x 5]
   links                               feature       text                                                                                                                                           finished_langua… finished_senten…
   <chr>                               <chr>         <chr>                                                                                                                                          <list<character> <list<character>
 1 /vufind/Record/SCAR_e2e7020723e94e… resumo_portu… Several researches demonstrate the importance of music in people's lives. However, in the case of people with hearing loss, this argument is …             [38]             [38]
 2 /vufind/Record/UFS-2_d19bd36f23285… resumo_portu… O presente estudo tem por objetivo analisar a influência da internet na produção acadêmica de estudantes do curso de Gestão de Turismo do Ins…              [9]              [9]
 3 /vufind/Record/UFS-2_4cc21adc20565… resumo_portu… A presente pesquisa objetiva a caracterização do campus da Universidade Federal de Sergipe Prof. José Aloísio de Campos , a partir da percepç…              [9]              [9]

Only to be sure about it, I tested nlp_set_param(language_detector, "coalesce_sentences", TRUE) and it also worked as expected:

# Source: spark<?> [?? x 5]
   links                               feature       text                                                                                                                                           finished_langua… finished_senten…
   <chr>                               <chr>         <chr>                                                                                                                                          <list<character> <list<character>
 1 /vufind/Record/SCAR_e2e7020723e94e… resumo_portu… Several researches demonstrate the importance of music in people's lives. However, in the case of people with hearing loss, this argument is …              [1]             [38]
 2 /vufind/Record/UFS-2_d19bd36f23285… resumo_portu… O presente estudo tem por objetivo analisar a influência da internet na produção acadêmica de estudantes do curso de Gestão de Turismo do Ins…              [1]              [9]
 3 /vufind/Record/UFS-2_4cc21adc20565… resumo_portu… A presente pesquisa objetiva a caracterização do campus da Universidade Federal de Sergipe Prof. José Aloísio de Campos , a partir da percepç…              [1]              [9]```

One more time, thank you very much for developing this package!

ghost commented 4 years ago

An error is occurring when I try to set the threshold with the command: language_detector <- nlp_set_param(language_detector, "threshold", 0.8)

Error: java.lang.Exception: No matched method found for class com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL.setThreshold

The error does not occur with other parameters ("lazy_annotator" or "threshold_label")

dkincaid commented 4 years ago

Dang. This is a problem I've been trying to solve from very early on. The issue here is that the setThreshold() method in the Scala code takes a Float instead of a Double. When sparklyr gets a parameter like 0.8 it maps it to a Double, so it can't find a method named setThreshold that takes a Double type argument.

I'll take another try at seeing if I can solve this in some generic way. That is really my preferred solution, but so far it has eluded me. As a last resort I would have to create a specific nlp_language_detector_threshold function and I just don't like that inconsitency for setting properties (some properties would use the generic nlp_set_param but others would need to use an object specific method).

Give me a few days to find a better solution.

ghost commented 4 years ago

Thanks again @dkincaid. I agree that a general function would be preferable.

There is no rush for these solutions. I appreciate any attention you can put on this.

Spite of these small adjustments the package is working very well!

dkincaid commented 3 years ago

It took a good bit of refactoring a few things, but I found a pretty nice solution. It's all been committed and pushed to master. So a reinstall of the new version (0.0.0.9015) fixes this bug.

Thank you @rodfileto for finding and reporting it!

ghost commented 3 years ago

Thank you @dkincaid for the implementation inside the function nlp_language_detector_dl_pretrained.

Nonetheless, there is a problem when I load the trained model. It was something that already happened before, that is why I used a locally loaded model and tried to customize it.

The error is the following:

language_detector <- nlp_language_detector_dl_pretrained(sc, input_cols = "sentence", output_col = "language", threshold = 0.8)

Error: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
dkincaid commented 3 years ago

Would you mind trying the latest release when you get a chance? I made a fix to that class for the threshold parameter. I wasn't getting the same error you were though, so I'm a little confused.

ghost commented 3 years ago

Of course I do not mind. I tried to install and the following error occurred:

> remotes::install_github("r-spark/sparknlp")
Downloading GitHub repo r-spark/sparknlp@HEAD
✓  checking for file ‘/tmp/RtmpMhJH7L/remotesfa151986508/r-spark-sparknlp-9962fe5/DESCRIPTION’
─  preparing ‘sparknlp’:
✓  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘sparknlp_0.0.0.9017.tar.gz’
   Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     ‘sparknlp/examples/tutorials/certification_trainings/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.Rmd’
   Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     ‘sparknlp/examples/tutorials/certification_trainings/5.1_Text_classification_examples_in_SparkML_SparkNLP.Rmd’
   Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     ‘sparknlp/tests/testthat/data/sentiment.parquet/.part-00000-f52ab1ca-1b8e-4b36-b52e-6041abb05345-c000.snappy.parquet.crc’
   Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     ‘sparknlp/tests/testthat/data/sentiment.parquet/part-00000-f52ab1ca-1b8e-4b36-b52e-6041abb05345-c000.snappy.parquet’

Installing package into ‘/home/rfileto/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
* installing *source* package ‘sparknlp’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
Warning: S3 methods ‘nlp_deep_sentence_detector.ml_pipeline’, ‘nlp_deep_sentence_detector.spark_connection’, ‘nlp_deep_sentence_detector.tbl_spark’ were declared in NAMESPACE but not found
Error: package or namespace load failed for ‘sparknlp’ in namespaceExport(ns, exports):
 undefined exports: nlp_deep_sentence_detector
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/rfileto/R/x86_64-pc-linux-gnu-library/3.6/sparknlp’
Error: Failed to install 'sparknlp' from GitHub:
  (converted from warning) installation of package ‘/tmp/RtmpMhJH7L/filefa1596d68f6/sparknlp_0.0.0.9017.tar.gz’ had non-zero exit status``
dkincaid commented 3 years ago

Oh no! That's what I get for trying to rush something late on a Friday. Totally my fault for missing a step in deploying the package. Dumb mistake on my part. Sorry about that. I just pushed the fix for it. Thanks again for all the help!

ghost commented 3 years ago

No worries @dkincaid. Thank you for the attention, mainly considering it is a friday night :)

I tested and now is working. The error java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z is only occurring with the spark version 2.3.3, which I was using before.

But there is a small typo in the function argument coalesce_sentences. In the case, is written coelesce_sentences.

I believe with this final adjustment this function will be working very properly.

Thank you very much for being so thoughtful in this issue.

dkincaid commented 3 years ago

Goodness, good eyes on that typo. It should be fixed now too. Hopefully it's all working for you now. I really appreciate the patience.