Using spacyr for language processing instead of the current UD treebank

fishfree commented 3 weeks ago

I found UD treebank models performed very weakly for some languages, esp. for CJK languages. spaCy supports so many languages and performs much better than UD treebank models.

massimoaria commented 1 week ago

Thank you for your comment.

We have already considered using spaCy and decided to continue with Udpipe.

This is because spaCy is not native in R but requires a Python installation, which often leads to numerous errors and requires a lot of work on the part of the user. It took me a whole day to get a properly functioning Python environment on my Mac to be able to use spacyr.

To improve Udpipe's performance, we plan to train updated models for the most commonly used languages. This will be done in the coming months.

fishfree commented 1 week ago

@massimoaria Thank you! Udpipe performs much faster than spaCy, for the former is written in C++. So the best option should be train UDpipe with more corpus for higher F1. After exploring, I think, besides spacyr, we may need also the spacy-conll package, which can parse texts into CoNLL-U format. However, CJK language models in spaCy does not output some fields in CoNLL-U, i.e. feats / lemma / misc . I doubt the lack of these fields probably cause the downstream analysis such as Clustering and etc. I also doubt that CJK languages without space as seperator will also cause some downstream analysis tasks.

fishfree commented 1 week ago

I tried using spaCy to parse CJK languages. I can attach the files FYI. global.zip Pls change the extension to .R

And the modified lines in Server.R as below:

  posTagging <- eventReactive({
    input$tokPosRun
  },{
    values$language <- sub("-.*","",input$language_model)

    # Select processing model based on language
    if (input$language_model %in% c("chinese", "japanese", "korean")) {                        
      # Initializing the spaCy model
      initialize_spacy_model(input$language_model, input$model_size)
      filtered_text <- values$txt %>% filter(doc_selected)
      doc_ids <- filtered_text$doc_id

      values$dfTag <- process_text_with_spacy(filtered_text$text)
      # Add `doc_ids` as a column in `values$dfTag`
      if (!"doc_id" %in% colnames(values$dfTag)) {
        values$dfTag <- cbind(doc_id = doc_ids, values$dfTag)
      }
    } else {
      ## download and load model language
      udmodel_lang <- loadLanguageModel(language = input$language_model)

      ## set cores for parallel computing
      ncores <- max(1,parallel::detectCores()-1)

      ## set cores for windows machines
      if (Sys.info()[["sysname"]]=="Windows") {
        cl <- makeCluster(ncores)
        registerDoParallel(cl)
      }

      #Lemmatization and POS Tagging
      values$dfTag <- udpipe(object=udmodel_lang, x = values$txt %>%
                             filter(doc_selected),
                           parallel.cores=ncores)
    }
    # Merge metadata from the original txt object
    values$dfTag <- values$dfTag %>%
      left_join(values$txt %>% select(-text, -text_original), by = "doc_id") %>%
      filter(!is.na(upos)) %>%
      posSel(., c("ADJ","NOUN","PROPN", "VERB"))
    values$dfTag <- highlight(values$dfTag)
    values$dfTag$docSelected <- TRUE
    values$menu <- 1
  }
  )

  ## Tokenization & PoS Tagging ----

  output$optionsTokenization <- renderUI({
    list(
    selectInput(
      inputId = 'language_model', label="Select language", choices = names(lang_map), selected = "english",
      multiple=FALSE,
      width = "100%"
    ),
    selectInput("model_size", "Select Model Size", choices = c("small" = "sm", "medium" = "md", "large" = "lg"), selected = "sm")
    ) 
  })

massimoaria / tall

Using spacyr for language processing instead of the current UD treebank #87