phenoscape / rphenoscape

R package to make phenotypic traits from the Phenoscape Knowledgebase available from within R.
https://rphenoscape.phenoscape.org/
Other
5 stars 5 forks source link

Evaluate suitability of NeXML data extracting functions #172

Closed johnbradley closed 3 years ago

johnbradley commented 3 years ago

Determine if the NeXML data extracting functions would better as additions to RNeXML.

There are several data extracting functions that process NeXML objects :

Determine if the functions processing a list of NeXML objects (pk_get_study*) can be dropped in favor of exporting the internal functions they call.

johnbradley commented 3 years ago

For now we plan on leaving these functions in rphenoscape.

Here is what I am thinking for renaming these functions:

Thoughts @hlapp ?

johnbradley commented 3 years ago

Looking at the possibility of reducing the ontotrace and study RNeXML::get_characters functions into one function. From a high level the differences are

pk_get_ontotrace_xml

We create a nexml class for ontotrace NEXML data:

> nex <- pk_get_ontotrace_xml(taxon = c("Ictalurus", "Ameiurus"), entity = "fin spine")

For example pk_get_ontotrace() the values are 1, NA, or 1 and 0:

> mat <- head(pk_get_ontotrace(nex))
> mat[,c(1,4,5)]
                    taxa anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
1      Ameiurus brunneus                                        1                                               1
2         Ameiurus catus                                        1                                               1
3         Ameiurus melas                                       NA                                         1 and 0
4       Ameiurus natalis                                       NA                                               1
5     Ameiurus nebulosus                                        1                                               1
6 Ameiurus platycephalus                                        1                                               1

For example pk_get_study_by_one() the values are "present", NA, or "":

> mat <- head(pk_get_study_by_one(nex))
> mat[,c(1,3,4)]
                    taxa anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
1      Ameiurus brunneus                                  present                                         present
2         Ameiurus catus                                  present                                         present
3         Ameiurus melas                                       NA                                                
4       Ameiurus natalis                                       NA                                         present
5     Ameiurus nebulosus                                  present                                         present
6 Ameiurus platycephalus                                  present                                         present

To me it seems like some meaning has been lost here with "1 and 0" becoming "".

pk_get_study_xml examples

We create a nexml class for study NEXML data:

> (slist <- pk_get_study_list(taxon = "Ictalurus australis", entity = "fin"))
> (nex_list <- pk_get_study_xml(slist$id))
> nex2 <- nex_list[[1]]

For example pk_get_ontotrace() we end up with only one number for column but a larger range:

> mat <- head(pk_get_ontotrace(nex2))
> mat[,c(1,4,5)]
                    taxa Anal-fin rays, species mean count Anterior dentations of pectoral spine
1      Ameiurus brunneus                                 1                                     3
2         Ameiurus catus                                 2                                     2
3         Ameiurus melas                                 2                                     1
4       Ameiurus natalis                                 3                                     1
5     Ameiurus nebulosus                                 2                                     2
6 Ameiurus platycephalus                                 2                                     3

For example pk_get_study_by_one() we end up with labels that make it easier to understand the results:

> mat <- head(pk_get_study_by_one(nex2))
Map symbols to labels...
> mat[,c(1,4,5)]
                    taxa Anterior dentations of pectoral spine Anterior distal serrae of pectoral spine
1      Ameiurus brunneus                                 large               <3 moderately sharp serrae
2         Ameiurus catus                              moderate              3-6 moderately sharp serrae
3         Ameiurus melas                                 small             absent or scarcely developed
4       Ameiurus natalis                                 small              3-6 moderately sharp serrae
5     Ameiurus nebulosus                              moderate               <3 moderately sharp serrae
6 Ameiurus platycephalus                                 large               <3 moderately sharp serrae
johnbradley commented 3 years ago

How about creating a function that supports both cases like so?

get_char_matrix <- function(nex, otus_id = TRUE, translate_symbols=FALSE) {...}

The otus_id parameter is just passed to RNeXML::get_characters. When translate_symbols is TRUE we apply the logic that translates numbers to labels.

hlapp commented 3 years ago

I think that's a good way to think about it. 0, 1, 2 etc are symbols that would be used in a traditional character state matrix format that most analysis programs (and R packages for comparative analysis) will expect (for categorical character states). (Note that technically, there is no such thing as a "range" of such numbers; there are no implied numeric semantics to 1 vs 0 or 3 other than that they signal distinct character states (much like nucleotide bases in genetic data).)

Instead of translate_symbols, I'd suggest something like states_as_symbols, which should probably default to TRUE (because most analysis functions will expect symbols, not labels).

hlapp commented 3 years ago

To me it seems like some meaning has been lost here with "1 and 0" becoming "".

The problem there is that the state for this taxon is polymorphic. I suspect the code can't handle that when using labels (it would have to say present and absent, for example). This sounds more like a bug.

hlapp commented 3 years ago

Instead of translate_symbols, I'd suggest something like states_as_symbols, which should probably default to TRUE (because most analysis functions will expect symbols, not labels).

Of course, states_as_labels with default FALSE would be equally suitable. Or perhaps even better, because if one sets states_as_symbols to FALSE (i.e., away from the default), it's not obvious how states would be presented instead. Whereas with states_as_labels, if one sets it to TRUE (i.e., away from the default), the name of the parameter suggests clearly what would happen.

johnbradley commented 3 years ago

Fixed by #228