Open fishfree opened 3 weeks ago
Thank you for your comment.
We have already considered using spaCy and decided to continue with Udpipe.
This is because spaCy is not native in R but requires a Python installation, which often leads to numerous errors and requires a lot of work on the part of the user. It took me a whole day to get a properly functioning Python environment on my Mac to be able to use spacyr.
To improve Udpipe's performance, we plan to train updated models for the most commonly used languages. This will be done in the coming months.
@massimoaria Thank you! Udpipe performs much faster than spaCy, for the former is written in C++. So the best option should be train UDpipe with more corpus for higher F1. After exploring, I think, besides spacyr, we may need also the spacy-conll package, which can parse texts into CoNLL-U format. However, CJK language models in spaCy does not output some fields in CoNLL-U, i.e. feats / lemma / misc . I doubt the lack of these fields probably cause the downstream analysis such as Clustering and etc. I also doubt that CJK languages without space as seperator will also cause some downstream analysis tasks.
I tried using spaCy to parse CJK languages. I can attach the files FYI. global.zip Pls change the extension to .R
And the modified lines in Server.R as below:
posTagging <- eventReactive({
input$tokPosRun
},{
values$language <- sub("-.*","",input$language_model)
# Select processing model based on language
if (input$language_model %in% c("chinese", "japanese", "korean")) {
# Initializing the spaCy model
initialize_spacy_model(input$language_model, input$model_size)
filtered_text <- values$txt %>% filter(doc_selected)
doc_ids <- filtered_text$doc_id
values$dfTag <- process_text_with_spacy(filtered_text$text)
# Add `doc_ids` as a column in `values$dfTag`
if (!"doc_id" %in% colnames(values$dfTag)) {
values$dfTag <- cbind(doc_id = doc_ids, values$dfTag)
}
} else {
## download and load model language
udmodel_lang <- loadLanguageModel(language = input$language_model)
## set cores for parallel computing
ncores <- max(1,parallel::detectCores()-1)
## set cores for windows machines
if (Sys.info()[["sysname"]]=="Windows") {
cl <- makeCluster(ncores)
registerDoParallel(cl)
}
#Lemmatization and POS Tagging
values$dfTag <- udpipe(object=udmodel_lang, x = values$txt %>%
filter(doc_selected),
parallel.cores=ncores)
}
# Merge metadata from the original txt object
values$dfTag <- values$dfTag %>%
left_join(values$txt %>% select(-text, -text_original), by = "doc_id") %>%
filter(!is.na(upos)) %>%
posSel(., c("ADJ","NOUN","PROPN", "VERB"))
values$dfTag <- highlight(values$dfTag)
values$dfTag$docSelected <- TRUE
values$menu <- 1
}
)
## Tokenization & PoS Tagging ----
output$optionsTokenization <- renderUI({
list(
selectInput(
inputId = 'language_model', label="Select language", choices = names(lang_map), selected = "english",
multiple=FALSE,
width = "100%"
),
selectInput("model_size", "Select Model Size", choices = c("small" = "sm", "medium" = "md", "large" = "lg"), selected = "sm")
)
})
I found UD treebank models performed very weakly for some languages, esp. for CJK languages. spaCy supports so many languages and performs much better than UD treebank models.