richelbilderbeek / pirouette

R package that estimates the error BEAST2 makes from a given phylogeny
GNU General Public License v3.0
3 stars 2 forks source link

Add functionality to implement multiple error functions #371

Closed thijsjanzen closed 4 years ago

thijsjanzen commented 4 years ago

Currently, pirouette supports the nLTT as error function. In an ideal world, the user could supply an arbitrary error function, or multiple error functions, and use these.

I imagine for instance the user supplying a list containing the used error functions, and pirouette then applying each function to the obtained posterior.

Obviously, the default should still be nLTT, to retain (backwards) compatibility

richelbilderbeek commented 4 years ago

I think it is fine to replace the error_function, so I've taken the first steps. Until I've hit this:

from

The figure shows that the output of pir_run is a data frame with as much rows as there are experiments. There are a lot of tests and functions (e.g. pir_plot) that depend on this output, so it will be quite some work.

We can add another column called 'error_fun' per error function, meaning that pir_run returns a data frame with as much rows as error comparisons are done.

richelbilderbeek commented 4 years ago

A test I've added, to show how I would enjoy the interface to be:

test_that("allow to add more errors", {

  skip("Issue 371, Issue $371")
  # Use 'error_funs' instead of 'error_fun', to allow for more error functions
  expect_silent(
    create_error_measure_params(
      error_funs = get_nltt_error_fun()
    )
  )
  expect_silent(
    create_error_measure_params(
      error_funs = list(get_nltt_error_fun(), get_nltt_error_fun())
    )
  )
})
thijsjanzen commented 4 years ago

I think it is fine to replace the error_function, so I've taken the first steps. Until I've hit this:

from

The figure shows that the output of pir_run is a data frame with as much rows as there are experiments. There are a lot of tests and functions (e.g. pir_plot) that depend on this output, so it will be quite some work.

We can add another column called 'error_fun' per error function, meaning that pir_run returns a data frame with as much rows as error comparisons are done.

  • @thijsjanzen: I think it can be done, but it will take some work. I assume you volunteer?
  • @Giappo: what do you think about all this?

I'm happy to give it a shot. As for the output frame, in the tidy format, this could be a table with the following columns: experiment, tree, inference model, inference_model_weight, site_model, clock_model, tree_prior, error_fun, error_val

Tracking down the functions depending on the non-tidy setup will be an interesting task.

richelbilderbeek commented 4 years ago

Cool, if @Giappo agrees, then I'd say you go hit it :+1:

I agree having tidy data is better, but I suggest to make this another Issue. For now, I hope you use the wide form, also the refactoring for all dozens of packages depending on pirouette will be too complex. Can you agree on that?

thijsjanzen commented 4 years ago

Cool, if @Giappo agrees, then I'd say you go hit it 👍

I agree having tidy data is better, but I suggest to make this another Issue. For now, I hope you use the wide form, also the refactoring for all dozens of packages depending on pirouette will be too complex. Can you agree on that?

I see your point, but I have no idea how one would analyze a dataframe in the wide format, given multiple error functions, unless you parse the column names, which is very ugly.

But I'll give it a try next week.

richelbilderbeek commented 4 years ago

Nah, we convert the data frame to long format in pir_plot, see here. It's three lines of code. I guess one can easily modify that :+1:

richelbilderbeek commented 4 years ago

Hi @thijsjanzen, just to write down my thoughts explicitly:

To keep the current tests passing, I suggest the following return value of pir_run:

Current

tree inference_model ... error_1 ...
... ... ... ... ...

Suggested

tree inference_model error_fun_name ... error_1 ...
... ... ... ... ... ...

As said before, converting to tidy data is easy, especially for l33t like us :sunglasses:

thijsjanzen commented 4 years ago

I don't really understand how you would make that tidy. I would imagine something like:

[tree, inference_model, fun_1_error_1, fun_1_error_2, fun_1_error_3, fun_2_error_1, fun_2_error_2, fun_2_error_3 ....]

then:

tidy_data <- un_tidy_data %>%
        gather(key = "error_fun", value = "error", -c(tree, inference_model) ) %>%
        mutate(  # some awful string parsing function that transforms fun_1_error_1 to the name of fun_1 )
richelbilderbeek commented 4 years ago

Sure, that will also work. I predict my version will be easier to implement, but I do suggest you try the implementation you feel most comfortable with.

Good luck :+1:

richelbilderbeek commented 4 years ago

Will take over for now.

richelbilderbeek commented 4 years ago

@thijsjanzen: I am sorry for the noise, but I don't need this for now.

I will unassign myself. First to need this Issue done gets it :+1:

richelbilderbeek commented 4 years ago

Hi @thijsjanzen, is this Issue still relevant? If no, I will close it :+1:

richelbilderbeek commented 4 years ago

26 days of no response, sounds like a 'no' :+1: