ropensci / taxlist

Handling taxonomic lists
https://docs.ropensci.org/taxlist/
12 stars 4 forks source link

Linking 'taxlist' and 'taxa' #2

Closed kamapu closed 5 years ago

kamapu commented 5 years ago

Dear @zachary-foster I'll need your support in this matter. I started with a first attempt to export an object of class taxlist to one in taxa (I'm not sure now which one, but I'm browsing your vignette from the beginning on). Here an script creating the example object and handlling it:

library(taxlist)
library(taxa)

# A subset of the example in the package
plants <- subset(Easplist, TaxonName %in% c("Crabbea","Monechma","Pentodon",
                "Commiphora"), keep_parents=TRUE, keep_children=TRUE)

# A table with the accepted names (taxa  does not support synonyms)
plant_table <- accepted_name(plants)
plant_table$Parent <- with(plants@taxonRelations,
        Parent[match(plant_table$TaxonConceptID, TaxonConceptID)])

# Ladies and gentlemen: taxa
plant_taxon <- taxon(
        name=taxon_name(plant_table$TaxonName),
        rank=taxon_rank(paste(plant_table$Level)),
        id=taxon_id(plant_table$TaxonConceptID),
        authority=plant_table$AuthorName)

And then, I got a brain jam. My next step would be to use the function hierarchy() to establish parent-child relationships but the example in the documentation seems to be much more simple than this case. In plant_table I included the information on parent taxa for genus and lower ranks. Can I use it in one command for establishing the hierarchy in object plant_taxon? How?

kamapu commented 5 years ago

I continued with the vignette and am, in my opinion, one step forward. Though this may not be the most elegant way to do it, but I can construct parental lines as follows:

plant_hierarchy <- list()
plant_hierarchy[[1]] <- plant_table$TaxonConceptID
i <- 2
repeat{
    plant_hierarchy[[i]] <- plant_table$Parent[match(plant_hierarchy[[i - 1]],
                    plant_table$TaxonConceptID)]
    i <- i + 1
    if(!all(is.na(plant_hierarchy[[i - 1]]))) next else break
}
plant_hierarchy <- do.call(cbind, plant_hierarchy)[,-length(plant_hierarchy)]

I can even go farther and produce a list with those hierarchies (discarding the top ranks).

plant_hierarchy <- split(plant_hierarchy, 1:nrow(plant_hierarchy))
plant_hierarchy <- lapply(plant_hierarchy, function(x) x[!is.na(x)])
plant_hierarchy <- plant_hierarchy[sapply(plant_hierarchy, function(x) length(x) > 1)]

Now is my question, can I use those vectors (which are the taxon identifiers) as indexes in plant_taxon to construct the hierarchies? How?

Perhaps I should also include @sckott in this discussion.

zachary-foster commented 5 years ago

Hi @kamapu, I will look at this more soon, but I think we will be able to do this using parse_tax_data from taxa

zachary-foster commented 5 years ago

Hi @kamapu, sorry for the delay. I meant parse_edge_list in the above comment. I made a rough function to convert from taxlist to taxmap. It seems to capture most of the information, but might need some work. Hopefully it works on other taxlist objects too. It needs the dev version on taxa on the eval branch:

# devtools::install_github("ropensci/taxa", ref = "eval")

library(taxlist)
library(taxa)

parse_taxlist <- function(intput) {
  # Use the edge list to start making the intput object
  obj = taxa:::parse_edge_list(intput@taxonRelations, taxon_id = "TaxonConceptID", supertaxon_id = "Parent", taxon_name = "TaxonConceptID", taxon_rank = "Level")
  names(obj$data) <- c("relations")

  # Set taxon names and authorities
  concept_data <- intput@taxonNames[! duplicated(intput@taxonNames$TaxonConceptID), ]
  obj$set_taxon_names(concept_data$TaxonName[match(obj$taxon_ids(), concept_data$TaxonConceptID)])
  obj$set_taxon_auths(concept_data$AuthorName[match(obj$taxon_ids(), concept_data$TaxonConceptID)])

  # Add traits table to the intputect
  obj$data$traits <- intput@taxonTraits
  names(obj$data$traits)[1] <- "taxon_id"

  # Add views table to the intputect
  obj$data$views <- intput@taxonViews

  # Add synonyms in a table
  obj$data$synonyms <- intput@taxonNames[duplicated(intput@taxonNames$TaxonConceptID), c("TaxonConceptID", "TaxonName", "AuthorName")]
  names(obj$data$synonyms) <- c("taxon_id", "synonym", "synonym_authority")

  return(obj)
}

obj <- parse_taxlist(Easplist)
obj
> obj
<Taxmap>
  3887 taxa: 1. Abelmoschus esculentus, 2. Abutilon indicum, 3. Abutilon mauritianum ... 56134. Canellaceae, 56135. Penaeaceae, 56136. Marantaceae
  3887 edges: 54753->1, 54754->2, 54754->3, 54755->4, 54755->5, 54755->6, 54755->7 ... NA->56130, NA->56131, NA->56132, NA->56133, NA->56134, NA->56135, NA->56136
  4 data sets:
    relations:
      # A data.frame: 3887 x 8 (first 3 rows shown)
        taxon_id TaxonConceptID AcceptedName Basionym Parent   Level ViewID  uri
      1        1              1            1       NA  54753 species      1 <NA>
      2        2              2            2       NA  54754 species      1 <NA>
      3        3              3            3       NA  54754 species      1 <NA>
    traits:
      # A data.frame: 311 x 2 (first 3 rows shown)
        taxon_id       lf_behn_2018
      3        7       phanerophyte
      4        9       phanerophyte
      7       18 facultative_annual
    views:
      # A data.frame: 3 x 3
        ViewID                                 secundum view_bibtexkey
      1      1            African Plant Database (2012)  CJBGSANBI2012
      2      2 Taxonomic Name Resolution Service (2018)       TNRS2018
      3      3                    The Plant List (2013)        TPL2013
    synonyms:
      # A data.frame: 1509 x 3 (first 3 rows shown)
         taxon_id               synonym       synonym_authority
      2         1   Hibiscus esculentus                      L.
      5         3        Pavonia patens        (Andrews) Chiov.
      23       20 Spilanthes mauritiana (A. Rich. ex Pers.) DC.
  0 functions:
kamapu commented 5 years ago

@zachary-foster This will get funny (it is a quite different logic than mine but I will trust rather your way). Just few comments: 1) Could you commit the function in a fork, so we can also document it evolution? 2) I think the assignation of accepted names and synonyms through function duplicated() can be risky: The accepted name of a taxon may also appear after its synonyms in the slot taxonNames. 3) I was wrong, you seem also to support synonyms in taxa. 4) The final function should also be set as S4 method (I have the impresion, vegtable objects may also be converted into taxmap ones). I could care on 2) and 4), once the function is in GitHub.

zachary-foster commented 5 years ago

Could you commit the function in a fork, so we can also document it evolution?

Its on the "eval" branch of taxa. Do you mean on a fork of taxlist? I am fine with it being in either package. taxa does not have parsers for specific formats (since its a general-use foundational package), so it might make more sense to have it in taxlist.

I was wrong, you seem also to support synonyms in taxa.

In a way, but not explicitly. taxa supports arbitrary data assigned to taxa. The synonym table is just an arbitrary table with rows assigned to taxa (all the other tables are too). So, the taxamap object does not "know" it contains synonyms, if that makes sense.

I could care on 2) and 4), once the function is in GitHub.

Sounds good. Let me know if you need help with manipulating taxmap objects and such. You can name the tables and columns whatever you want, but each table needs a "taxon_id" column if it contains per-taxon data, otherwise the filtering functions will not work with that table.

kamapu commented 5 years ago

OK. If you don't mind @zachary-foster I will just copy the code and adapt it to a function in taxlist and will try to do the counter-function, though I guess, it won't be possible to do transformations from taxlist to taxamap and back without loosing information.

zachary-foster commented 5 years ago

OK. If you don't mind @zachary-foster I will just copy the code and adapt it to a function in taxlist

That is fine with me

it won't be possible to do transformations from taxlist to taxamap and back without loosing information.

Perhaps not with the function I wrote, but it should be possible.

kamapu commented 5 years ago

Hello @zachary-foster I just came back from another trip and started working on the functions and now I have one. I adapted your code and wrote one function to convert objects from taxlist to Taxmap, and wrote a second function for the conversion back. You will need then to install taxlist from my branc:

library(devtools)
install_github("kamapu/taxlist", ref="miguel")
install_github("ropensci/taxa", ref="eval")

I included an R-image with three examples:

Test5 <- taxmap2taxlist(ex_taxmap, traits="info", reindex=TRUE) summary(Test5)


Note that the later is working with the example from package `taxa`.
kamapu commented 5 years ago

Some comments to my later message:

  1. The example will only work with the branch eval of taxa. Are you planing to merge it with master?
  2. If I apply some functions in these Taxmap objects (e.g. print_tree()) I will get a warning regarding the table views which is missing the column taxon_id.
  3. I had the impression, the function print_tree() is not working properly at the moment. I only get names of families and NAs in the display, where I was expecting a tree.
  4. I realized that the size of Taxmap objects is much smaller than the taxlist ones, thought the content is almost the same. Is this inherent to R6 or I'm doing something wrong?
zachary-foster commented 5 years ago

Hi @kamapu

Cool, I hope you had a nice trip!

I included an R-image with three examples

Is the R image in the package or was it supposed to be attached to this issue? I don't see it.

The example will only work with the branch eval of taxa. Are you planing to merge it with master?

Yes, it will be merged eventually. I am hoping to get a change to work on it soon.

If I apply some functions in these Taxmap objects (e.g. print_tree()) I will get a warning regarding the table views which is missing the column taxon_id.

Thats ok, I think I will remove that warning eventually. It happens when you filter taxa but some of the data is not classified by taxa so it cant be filtered.

I had the impression, the function print_tree() is not working properly at the moment. I only get names of families and NAs in the display, where I was expecting a tree.

It should be working, but I have seen that error once before, but have never been able to reproduce it. If I can reproduce it with the code you supplied, perhaps I can finally fix it!

I realized that the size of Taxmap objects is much smaller than the taxlist ones, thought the content is almost the same. Is this inherent to R6 or I'm doing something wrong?

You mean how much RAM it takes? I am not sure. Its probably a difference between how S4 and R6 stores data. R6 is pretty lightweight as I understand it. I don't have much experience with S4. I will look into it when I test out the code.

kamapu commented 5 years ago

Hi @zachary-foster : Here a quick reply.

Is the R image in the package or was it supposed to be attached to this issue? I don't see it.

The image will be installed with taxlist from the miguel branch.

library(devtools)
install_github("kamapu/taxlist", ref="miguel")

Then you just need following command to get the objects in your session:

library(taxlist)
load(file.path(path.package("taxlist"), "taxlist_examples/examples.Rda"))

Note, this image is not in master.

You mean how much RAM it takes? [...] R6 is pretty lightweight as I understand it [...]

Yes, I meant the allocated RAM. If you compare outputs of object.size() once the objects are converted, you will wander, how big the differences are... unfear!

Regarding the warning message, I assume, it is OK, eventhough some users get scared from warnings (they confuse them with error messages) and it can cause some troubles when checking examples.

Just to remind you, the export/import functions were requested by the editor to proceed with the submission of taxlist to ROpenSci. I assume, we may have to get it working on the respective master branches, before I request to continue with the process.

zachary-foster commented 5 years ago

Thanks for the info on finding the test data! I did not see it in your first message until now.

These function seem to work fine for me, at least with the test data.

I cannot replicate the print_tree NA bug. They all look normal for me.

In regards to memory use, I recommend using pryr::object_size, since R6 objects are environments and object.size does not count environments. taxlist objects are actually much smaller than taxmap objects, since R6 objects store all of the functions that operate on them with the object and S4 don't. I would guess that both will have about the same RAM requirements for very large objects.

> object_size(Test1$data)
19.5 kB
> object_size(data1)
25.2 kB
> object_size(Test1)
1.62 MB

I assume, we may have to get it working on the respective master branches, before I request to continue with the process.

Ok, i will try to get eval pushed to master soon.

kamapu commented 5 years ago

@zachary-foster some news on this regard?

zachary-foster commented 5 years ago

I am currently rewriting the fundamental classes of taxa as S3 vectors using the vctrs package so they act more like base R, so much of the changes in eval will probably not be used unfortunately. I will try to figure out what part of eval you need to get the conversion functions to work and merge that part.

zachary-foster commented 5 years ago

Hi @kamapu, it should work with master now.

kamapu commented 5 years ago

Great! I tested it and got no errors! Then, I will proceed to ROpenSci...

kamapu commented 4 years ago

@zachary-foster Is there a release of the current version of taxa planned for soon? I ask it because the conversions from taxlist objects and back is only working with the last GitHub version but not with the CRAN one, thus a submission to CRAN won't work at this stage.

zachary-foster commented 4 years ago

I had not planned one, but I probably could release an updated version to CRAN if you need. I am about 70% done rewriting taxa to make the more basic classes like base R vectors (can use as columns in tables and such), so I have not worked on the master branch for a while.

kamapu commented 4 years ago

I'll need a new version of taxa, at least concerning the compatibility between it and taxlist.

zachary-foster commented 4 years ago

New version of taxa is on CRAN now. Sorry for the delay!