ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
265 stars 59 forks source link

Add ability to recursively filter output from get_* functions #386

Open sckott opened 9 years ago

sckott commented 9 years ago

E.g.,

get_tsn("Poa")

A lot of results are given...so user filters with regex

# some output printed, prompt given
# user types:
ann
# which filters to strings having "ann"

or by row number(s)

# some output printed, prompt given
# user types:
1:5
# which filters to rows 1 to 5

And this could go on recursively until user exits or ends up with only one result, thus giving back the id itself

thoughts @EDiLD @zachary-foster

zachary-foster commented 9 years ago

@sckott, I just noticed you asked fo thoughts on this. I tried running get_tsn("Poa") and get_tsn('Poa', ask=TRUE, rows = NA), but just got back a single result. Did something change in the last month? I also tried get_tsn('Satyrium'), another ambiguous taxon name, and only got back a single result.

zachary-foster commented 9 years ago

Oh yea, I forgot to share thoughts. I think its a good idea if it does not take too much work to implement. Is it common for there to be that many homonyms for a taxon name? Or perhaps get_tsn("Poa") used to return the taxon ids for all of the species in that genus rather than the genus itself?

sckott commented 9 years ago

@zachary-foster yes, there have been some changes

There are two changes: For get_tsn() we get accepted names by default now, see the accepted parameter

For the case of Poa annua using ITIS data, the API call http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=poa%20annua results in just one name that is accepted, while all others are not accepted, so only one is returned.

Second, we now check for a direct match using grep(). If the regrex match returns only one match, then we just return that one thing, if more than one match, we return all of them and user is given prompt, etc.

Does that makes sense?

sckott commented 9 years ago

@zachary-foster for your second comment:

Hard to say how common multiple names are, depends on the structure of the queries done on the server side of data sources too, some may do a more fuzzy search approach, and some more of a direct match search - I don't think I've tried implementing this yet, so not sure how hard it would be, but worth a try?

zachary-foster commented 9 years ago

@sckott Ok, I understand now. Thanks for the explanation.

I think its worth a try. I dont know if you meant "recursively" literally, but a while (nrow(tsn_df) > 1) {...} loop around the current user prompt code seems like it would work. In the case of get_tsn, maybe something like (untested code):

if (ask) {
  names(tsn_df)[grep(searchtype, names(tsn_df))] <- "target"
  tsn_df <- tsn_df[order(tsn_df$target), ]
  rownames(tsn_df) <- 1:nrow(tsn_df)
  while (nrow(tsn_df) > 1) {
    message("\n\n")
    print(tsn_df)
    message("\nMore than one TSN found for taxon '", 
            x, "'!\n\n            Enter rownumber of taxon (other inputs will return 'NA'):\n")
    take <- scan(n = 1, quiet = TRUE, what = "raw")
    if (length(take) == 0) {
      take <- "notake"
      att <- "nothing chosen"
    }
    if (take %in% seq_len(nrow(tsn_df))) {
      take <- as.numeric(take)
      message("Input accepted, took taxon '", as.character(tsn_df$target[take]), 
              "'.\n")
      tsn <- tsn_df$tsn[take]
      att <- "found"
    }
    else if (any(grepl(take, tsn_df$target))) {
      tsn_df <- tsn_df[grepl(take, tsn_df$target), ]
      tsn <- tsn_df$tsn
    }
    else {
      tsn <- NA
      mssg(verbose, "\nReturned 'NA'!\n\n")
      att <- "not found"
    }

  }
}
else {
  tsn <- NA
  att <- "NA due to ask=FALSE"
}

If you are worried about the possiblity of infinite loops caused by while, maybe a for (1:max_prompts) with a if (nrow(tsn_df) == 1) break.

sckott commented 9 years ago

@zachary-foster Right, while loop seems appropriate