Closed ghost closed 3 years ago
Thanks for creating the issue. That is an unimplemented feature that I missed. Let me see about getting something implemented. I should be able to get it in here in a day or two. If you need a fix right away, this should be a work around:
language_detector <- sparklyr:::ml_set_param(language_detector, "coalesceSentences", FALSE)
Well that was easier than I thought it was going to be. It should be fixed now. There is a new function named nlp_set_param()
that you can use. It would look like this:
language_detector <- nlp_set_param(language_detector, "coalesce_sentences", FALSE)
Please let me know if you have a chance to try it. Just reinstall from the master branch.
Hi @dkincaid,
I installed from the master branch and worked as expected. There is a language detection for each sentence as you may seen below:
# Source: spark<?> [?? x 5]
links feature text finished_langua… finished_senten…
<chr> <chr> <chr> <list<character> <list<character>
1 /vufind/Record/SCAR_e2e7020723e94e… resumo_portu… Several researches demonstrate the importance of music in people's lives. However, in the case of people with hearing loss, this argument is … [38] [38]
2 /vufind/Record/UFS-2_d19bd36f23285… resumo_portu… O presente estudo tem por objetivo analisar a influência da internet na produção acadêmica de estudantes do curso de Gestão de Turismo do Ins… [9] [9]
3 /vufind/Record/UFS-2_4cc21adc20565… resumo_portu… A presente pesquisa objetiva a caracterização do campus da Universidade Federal de Sergipe Prof. José Aloísio de Campos , a partir da percepç… [9] [9]
Only to be sure about it, I tested nlp_set_param(language_detector, "coalesce_sentences", TRUE)
and it also worked as expected:
# Source: spark<?> [?? x 5]
links feature text finished_langua… finished_senten…
<chr> <chr> <chr> <list<character> <list<character>
1 /vufind/Record/SCAR_e2e7020723e94e… resumo_portu… Several researches demonstrate the importance of music in people's lives. However, in the case of people with hearing loss, this argument is … [1] [38]
2 /vufind/Record/UFS-2_d19bd36f23285… resumo_portu… O presente estudo tem por objetivo analisar a influência da internet na produção acadêmica de estudantes do curso de Gestão de Turismo do Ins… [1] [9]
3 /vufind/Record/UFS-2_4cc21adc20565… resumo_portu… A presente pesquisa objetiva a caracterização do campus da Universidade Federal de Sergipe Prof. José Aloísio de Campos , a partir da percepç… [1] [9]```
One more time, thank you very much for developing this package!
An error is occurring when I try to set the threshold with the command:
language_detector <- nlp_set_param(language_detector, "threshold", 0.8)
Error: java.lang.Exception: No matched method found for class com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL.setThreshold
The error does not occur with other parameters ("lazy_annotator"
or "threshold_label"
)
Dang. This is a problem I've been trying to solve from very early on. The issue here is that the setThreshold()
method in the Scala code takes a Float
instead of a Double
. When sparklyr
gets a parameter like 0.8 it maps it to a Double
, so it can't find a method named setThreshold
that takes a Double
type argument.
I'll take another try at seeing if I can solve this in some generic way. That is really my preferred solution, but so far it has eluded me. As a last resort I would have to create a specific nlp_language_detector_threshold
function and I just don't like that inconsitency for setting properties (some properties would use the generic nlp_set_param
but others would need to use an object specific method).
Give me a few days to find a better solution.
Thanks again @dkincaid. I agree that a general function would be preferable.
There is no rush for these solutions. I appreciate any attention you can put on this.
Spite of these small adjustments the package is working very well!
It took a good bit of refactoring a few things, but I found a pretty nice solution. It's all been committed and pushed to master. So a reinstall of the new version (0.0.0.9015) fixes this bug.
Thank you @rodfileto for finding and reporting it!
Thank you @dkincaid for the implementation inside the function nlp_language_detector_dl_pretrained
.
Nonetheless, there is a problem when I load the trained model. It was something that already happened before, that is why I used a locally loaded model and tried to customize it.
The error is the following:
language_detector <- nlp_language_detector_dl_pretrained(sc, input_cols = "sentence", output_col = "language", threshold = 0.8)
Error: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
Would you mind trying the latest release when you get a chance? I made a fix to that class for the threshold parameter. I wasn't getting the same error you were though, so I'm a little confused.
Of course I do not mind. I tried to install and the following error occurred:
> remotes::install_github("r-spark/sparknlp")
Downloading GitHub repo r-spark/sparknlp@HEAD
✓ checking for file ‘/tmp/RtmpMhJH7L/remotesfa151986508/r-spark-sparknlp-9962fe5/DESCRIPTION’
─ preparing ‘sparknlp’:
✓ checking DESCRIPTION meta-information
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘sparknlp_0.0.0.9017.tar.gz’
Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L, :
storing paths of more than 100 bytes is not portable:
‘sparknlp/examples/tutorials/certification_trainings/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.Rmd’
Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L, :
storing paths of more than 100 bytes is not portable:
‘sparknlp/examples/tutorials/certification_trainings/5.1_Text_classification_examples_in_SparkML_SparkNLP.Rmd’
Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L, :
storing paths of more than 100 bytes is not portable:
‘sparknlp/tests/testthat/data/sentiment.parquet/.part-00000-f52ab1ca-1b8e-4b36-b52e-6041abb05345-c000.snappy.parquet.crc’
Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L, :
storing paths of more than 100 bytes is not portable:
‘sparknlp/tests/testthat/data/sentiment.parquet/part-00000-f52ab1ca-1b8e-4b36-b52e-6041abb05345-c000.snappy.parquet’
Installing package into ‘/home/rfileto/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
* installing *source* package ‘sparknlp’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
Warning: S3 methods ‘nlp_deep_sentence_detector.ml_pipeline’, ‘nlp_deep_sentence_detector.spark_connection’, ‘nlp_deep_sentence_detector.tbl_spark’ were declared in NAMESPACE but not found
Error: package or namespace load failed for ‘sparknlp’ in namespaceExport(ns, exports):
undefined exports: nlp_deep_sentence_detector
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/rfileto/R/x86_64-pc-linux-gnu-library/3.6/sparknlp’
Error: Failed to install 'sparknlp' from GitHub:
(converted from warning) installation of package ‘/tmp/RtmpMhJH7L/filefa1596d68f6/sparknlp_0.0.0.9017.tar.gz’ had non-zero exit status``
Oh no! That's what I get for trying to rush something late on a Friday. Totally my fault for missing a step in deploying the package. Dumb mistake on my part. Sorry about that. I just pushed the fix for it. Thanks again for all the help!
No worries @dkincaid. Thank you for the attention, mainly considering it is a friday night :)
I tested and now is working. The error java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
is only occurring with the spark version 2.3.3, which I was using before.
But there is a small typo in the function argument coalesce_sentences
. In the case, is written coelesce_sentences.
I believe with this final adjustment this function will be working very properly.
Thank you very much for being so thoughtful in this issue.
Goodness, good eyes on that typo. It should be fixed now too. Hopefully it's all working for you now. I really appreciate the patience.
Hi there,
Thanks very much for this package. It is very useful to text mining in a big corpus.
I loaded a local model of language detector and I would like to set coalescence as FALSE (https://nlp.johnsnowlabs.com/docs/en/annotators#languagedetectordl-language-detection-and-identiffication).
I tried to apply a direct modification in the object but it didn't work.
Thanks in advance.