schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching
MIT License
4 stars 0 forks source link

Strange behavior? #10

Open wcornwell opened 6 years ago

wcornwell commented 6 years ago

One strange behavior that I've just found: I wrote a little R script that is a wrapper for your python script:

shorten_word_string <- function(x,number_to_trim_to=2) {
  ul = unlist(strsplit(x, split = "\\s+"))[1:number_to_trim_to]
  paste(ul, collapse = " ")
}

get_number_words<-function(str1){
  sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
}

get_lookup_table <- function(names_of_interest) {
  require(dplyr)
  require(readr)
  names_of_interest_binom <-
    unlist(lapply(names_of_interest, shorten_word_string))

  nw <- get_number_words(names_of_interest) 
  names_of_interest[nw>4]<-shorten_word_string(names_of_interest[nw>4],4)

  write.table(
    names_of_interest,
    col.names = FALSE,
    quote=FALSE,
    row.names=FALSE,
    "data_raw/taxon-name-utils/orig_names.txt")

    system(
    'python data_raw/taxon-name-utils/scripts/synonymize.py -b -a expand data_raw/taxon-name-utils/orig_names.txt > data_raw/taxon-name-utils/expanded_names.txt'
  )
  system(
    'python data_raw/taxon-name-utils/scripts/synonymize.py -b -a merge -c data_raw/taxon-name-utils/orig_names.txt data_raw/taxon-name-utils/expanded_names.txt > data_raw/taxon-name-utils/merged-names.txt'
  )
  syn <-
    read.csv("data_raw/taxon-name-utils/expanded_names.txt",
             col.names = "syn",stringsAsFactors = FALSE)
  old_name <-
    read.csv("data_raw/taxon-name-utils/merged-names.txt", col.names = "old_name",stringsAsFactors = FALSE)
  good_names <-
    read.csv("data_raw/taxon-name-utils/data/theplantlist1.1/names_unique.csv",stringsAsFactors = FALSE)
  good_names$binom <- paste(good_names$genus, good_names$species)
  df <- data_frame(old_binom = old_name$old_name, new_binom = syn$syn)
  df_out <- filter(df, new_binom %in% good_names$binom)
  return(df_out)
}

It seems to work fine, For example:

> get_lookup_table(c("Pouteria australis","hello world","Pinus contorta"))
# A tibble: 2 x 2
  old_binom          new_binom             
  <chr>              <chr>                 
1 Pinus contorta     Pinus contorta        
2 Pouteria australis Planchonella australis

which is correct. Planchonella australis is the new name for Pouteria australis

except if the original names contain both the synonym and the correct name.

> get_lookup_table(c("Pouteria australis","hello world","Pinus contorta","Planchonella australis"))
# A tibble: 2 x 2
  old_binom              new_binom             
  <chr>                  <chr>                 
1 Pinus contorta         Pinus contorta        
2 Planchonella australis Planchonella australis

it loses the Pouteria australis -> Planchonella australis information.

Any idea what's going on?

dschwilk commented 6 years ago

I'll take a look. I'm a bit confused with the extra stuff here. How are you indicating the canonical names? It shouldn't lose any on that list and if it does something is wrong.

dschwilk commented 6 years ago

OK, I see what you are showing now. It was easier to look at the raw expanded and merged lists. Yes, the expanded names function will not return expanded synonyms that are already in the canonical list. I can look in more detail at why. In general, this is what I would want but I will take a look at synonymize.py.

I am working from imperfect memory but I believe that behavior may be to avoid endless loops on the sister synonym search. Maybe you can convince me why this is the wrong behavior and how to fix in synonymize.py.

wcornwell commented 6 years ago

Thanks for having a look.

I can see why it would be done the way you've done it, but from a user perspective with long lists it's hard to know this is happening.

If it's easier, one alternative to changing the script is to just change the readme instructions so that it's:

  1. filter to names that are not on the good list
  2. expand
  3. merge

that would also solve it from a user perspective

dschwilk commented 6 years ago

I won't have time to look at this more until after spring break. But perhaps what this ends is a rethink of data in/out and how this is used. Perhaps the full lookup table as a single step. The current weird design is an historical artefact of my initial uses.