traitecoevo / austraits

R package for accessing the AusTraits Plant database
https://traitecoevo.github.io/austraits/
MIT License
19 stars 1 forks source link

Remove switches in major functions #91

Closed fontikar closed 1 week ago

fontikar commented 2 weeks ago

Currently, the R package serves many different versions of the database, each with their own idiosyncrasies ✨ but this means that a lot of our functions have to behave according to the version of the loaded version.

  1. The following functions need a tear down of their switch() functions
  1. We need to include a deprecation message to notify the user that because of their version of the database is not supported by R package and to check out the documentation on backwards compatibility. Deprecation printing will be based off of check_compatability() from #93

    Some resources for deprecating functions:

  2. Currently the tests are written to test subsets of each version of the database. These will also need to be torn down accordingly and only one test will be retained for each major function.

  3. The summarise_* functions will be deprecated entirely, and won't be used with any database built by traits.build.1.0

fontikar commented 2 weeks ago

@ehwenk @dfalster

What do you think about this format for the helpfiles? I think it might be a big confusing to the user: https://lifecycle.r-lib.org/articles/stages.html But I like there is a visual badge to draw eye to details so users can learn that the function is no longer supported.

Screenshot 2024-09-05 at 1 22 33 PM

fontikar commented 2 weeks ago

We've decided this format for the messaging that prompts when using package functions with an old version of AusTraits

@dfalster proposed that we needed to offer options to either update database or package @ehwenk suggested that it needed to explicitly mention AusTraits I offered suggested wording, icons via cli

> extract_dataset(at_three, "Falser_2003")
✖ "extract_dataset" no longer supports AusTraits version 3.0.2
ℹ You can either update to a newer version of the data using `load_austraits()` OR
ℹ Install an older version of the package
ℹ See https://github.com/traitecoevo/austraits for details.
fontikar commented 1 week ago

I have hit an issue with deprecating the switch in trait_pivot_wider. trait_pivot_wider uses the $traits table, but check_compatibility checks on the$metadata table in the austraits object.

OPTION 1: Should we update the trait_pivot_ functions to now work with austraits? OR

OPTION 2: Should we add a new check_compatibility_traits function for delineating versions using only the traits table?

I'm not too familiar with the differences in col names across database versions but below is a reprex showing the col names that vary between the versions.

library(austraits)
#> Loading required package: RefManageR
#> Thanks for showing interest in `austraits`! Please consider citing this package - citation('austraits')

at_three <- load_austraits(version = "3.0.2", path = "ignore/data/austraits/")
#> Downloading AusTraits to 'ignore/data/austraits/'
#> Loading data from 'ignore/data/austraits//austraits-3.0.2.rds'
at_four <- load_austraits(version = "4.2.0", path = "ignore/data/austraits/")
#> Downloading AusTraits to 'ignore/data/austraits/'
#> Loading data from 'ignore/data/austraits//austraits-4.2.0.rds'
at_six <- load_austraits(version = "6.0.0", path = "ignore/data/austraits/")
#> Downloading AusTraits to 'ignore/data/austraits/'
#> Loading data from 'ignore/data/austraits//austraits-6.0.0.rds'

setdiff(names(at_six),names(at_three))
#> [1] "locations" "schema"    "metadata"
setdiff(names(at_six),names(at_four))
#> character(0)

Using col names I can delineate between version 6 and 3, but there are no differences on col names between version 6 and 4.

Any suggestions @ehwenk?

Path of least resistance is to replace the $traits argument with austraits object.

Thoughts @dfalster

fontikar commented 1 week ago

Following on discussions with @ehwenk, option 1 tends to confuses users when to use austraits$traits or streamlining usage by replacing input argument to take austraits object may be beneficial. This solution creates an output that is not the same as the input which may be confusing to user? (incoming list of 13 length, outgoing, 1 widened tibble). @fontikar suggest maybe tacking out the widened table to the austraits object i.e. austraits$wide_traits ?

fontikar commented 1 week ago

@ehwenk and I discussed whether we want to "close the loop" with pivotting wider? Do we want users to be able to revert back to longer after going wide? Should this format change be unidirectional?? How is this different to combined_tables?

Below is the behaviour of current trait_pivot functions

library(austraits)
#> Loading required package: RefManageR
#> Thanks for showing interest in `austraits`! Please consider citing this package - citation('austraits')

lite_three_wide <- austraits:::austraits_3.0.2_lite$traits |> 
  summarise_trait_means() |> 
  trait_pivot_wider()

lite_four_wide <- austraits:::austraits_4.2.0_lite$traits |> trait_pivot_wider()

lite_five_wide <- austraits:::austraits_5.0.0_lite$traits |> trait_pivot_wider()

# Go back to long format
back_to_long_three <- lite_three_wide |> trait_pivot_longer()
nrow(austraits:::austraits_3.0.2_lite$traits) == nrow(back_to_long_three)
#> [1] FALSE

back_to_long_four <- lite_four_wide |> trait_pivot_longer()
nrow(austraits:::austraits_4.2.0_lite$traits) == nrow(back_to_long_four)
#> [1] TRUE

back_to_long_five <- lite_five_wide |> trait_pivot_longer()
nrow(austraits:::austraits_5.0.0_lite$traits) == nrow(back_to_long_five)
#> [1] FALSE

nrow(austraits:::austraits_5.0.0_lite$traits)
#> [1] 116023
nrow(back_to_long_five)
#> [1] 149372

Created on 2024-09-10 with reprex v2.1.0

Thoughts team? @dfalster @ehwenk

fontikar commented 1 week ago

OPTION 1: Should we update the traitpivot functions to now work with austraits?

Currently plot_locations work with both austraits and traits table using:

  if( is.null(dim(aus_traits)) ){
  traits <- aus_traits$traits
  } else{
    traits <- aus_traits
  }

But I think this is also technical debt that needs to be removed and we should work with austraits list object?? Right?

ehwenk commented 1 week ago

Following on discussions with @ehwenk, option 1 tends to confuses users when to use austraits$traits or streamlining usage by replacing input argument to take austraits object may be beneficial. This solution creates an output that is not the same as the input which may be confusing to user? (incoming list of 13 length, outgoing, 1 widened tibble). @fontikar suggest maybe tacking out the widened table to the austraits object i.e. austraits$wide_traits ?

WIth regards to pivoting wider... I don't see any practical way to fully close the loop and re-pivot longer. trait_pivot_wider explicitly drops 4 columns (Line 49: meta_data_cols <- c("unit", "replicates", "measurement_remarks", "basis_of_value"), because these are often different for each trait measurement. One could come up with some complicated way to pack this information, but it would be a very busy column that would need to include the trait name before each set of information. I think that we're offering trait_pivot_wider as a way to see data with traits across different columns, but there shouldn't be the expectation that this is cyclical - and more than "extract". We also don't provide an "unextract" function. The question is then if there is any value in trait_pivot_longer

ehwenk commented 1 week ago

I've been thinking about plot locations, and I can see good reasons to retain the ability to use a single dataframe or a traits.build data object - there will be people who will use join.* functions to create a wider table that they might assign to a different name - or they might use their own filter functions within the traits table. Many people might know about the relational structure, but then simply save their manipulated dataframe to be a tibble and it shouldn't be that they can't continue manipulating/using functions that don't actually depend on the relational structure.

But I do have to think about how to implement "hooks" once we don't have database$metadata

ehwenk commented 1 week ago

For plot_locations there is nothing that inherently requires a traits.build formatted data object - in fact, it assumes that latitude/longitude have already been joined to a single table. So this function really doesn't need to use check_compatibility. It just needs to confirm that the table has columns latitude (deg) and longitude (deg) - and whatever has been specified as containing the trait data. So it should be fine for people to use this with completely different dataframes and the message should not be about old/new versions about AusTraits so much as suggesting to people they merge on coordinates if they haven't yet.

fontikar commented 1 week ago

@ehwenk

It just needs to confirm that the table has columns latitude (deg) and longitude (deg) - and whatever has been specified as containing the trait data. So it should be fine for people to use this with completely different dataframes and the message should not be about old/new versions about AusTraits so much as suggesting to people they merge on coordinates if they haven't yet.

I think the only difference between version <=3.0.2 and newer is this line

dplyr::select(location_name, `latitude (deg)`, `longitude (deg)`, !!feature)

In the old version location_name is replaced with site_name.

I think we can generalise the function using any_of

dplyr::select(any_of(location_name, `latitude (deg)`, `longitude (deg)`, !!feature))

That will remove the switch, and I will add a check for latitude and longitude

This does feel like an entirely different issue though so probably another branch and pursued after the switch deprecation.

fontikar commented 1 week ago

We also don't provide an "unextract" function. The question is then if there is any value in trait_pivot_longer

Just discussed with @dfalster - we all agree to deprecate. The value for users to revert their widened data back to long is not well known...if there is enough interest to revive it then we will revisit this later!

fontikar commented 1 week ago

I can see good reasons to retain the ability to use a single dataframe or a traits.build data object - there will be people who will use join.* functions to create a wider table that they might assign to a different name

We will add the flexibility of $traits or traits.build object for traits_pivot_wider as it has been done for plot_locations.

The compatibility check for traits.build >= 1.0 for $traits will be based on the column names. At the moment the function checks for treatment_context_id in the names of the $traits table. Looks like there are a few I could use:

setdiff(names(at_six$traits), names(austraits_4.2.0_lite$traits))
[1] "repeat_measurements_id" "temporal_context_id"    "plot_context_id"        "treatment_context_id"  
[5] "method_context_id"   

@ehwenk does these sound reasonable for checking for new traits.buildobjects?

I could take a look at the traits.build::db_trait_pivot_wider and hybridise the two?

This may be a seperate issue