grambank / rgrambank

R package to access and analyse Grambank's CLDF data
Apache License 2.0
3 stars 1 forks source link

Binarise #12

Closed HedvigS closed 1 year ago

HedvigS commented 1 year ago

Function that binarises grambank data correctly from a cldf values table

johenglisch commented 1 year ago

Some remarks on the test code:

  1. Put your test files into tests/testthat.
  2. Name them something like tests/testthat/test-binarise.R.
  3. Structure the test code in these little blocks:

    test_that("multistate features are being binarised", {
      data_with_multistate_features <- ...
      outcome <- binarise(data_with_multistate_features) %>% ...
      expected <- ...
      expect_equal(outcome, expected)
    })
    
    test_that("binarised features are not being overwritten", {
      data_with_binarised_features <- ...
      outcome <- binarise(data_with_binarised_features) %>% ...
      expected <- ...
      expect_equal(outcome, expected)
    })

(like this, basically)

If you do that you can just hit a key combination in RStudio (Cmd+Shift+T, apparently) and it will automatically find all your tests and run them. You'll get a nice summary in the Build tab:

ℹ Testing rgrambank
[...]
✔ | F W S  OK | Context
✔ |         2 | db

══ Results ═════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]

Advantages:

  1. You can get into a really nice workflow like ‘Change code – run tests – fix code – run tests – add test – run tests – fix code – run tests’ etc.

  2. Github can be set up to run tests automatically every time someone pushes to master or submits a PR, which is reeeally nice (we do that for our python programs and I assume it's probably possible with R as well).

Also, it's good practice to have your tests be as isolated as possible from the ‘messy real world’, i.e. not relying on a working internet connection or a remote file not being deleted:

https://github.com/grambank/rgrambank/blob/7a43c78ede5b36d42a65ff0f89dba34649b76e08/tests/test_binarise.R#L10

It might be better to read example data from a local file in the git repo or to manually create the data in the test program itself, like what you did with fake_raw_binary_data below (that's usually what I prefer):

https://github.com/grambank/rgrambank/blob/7a43c78ede5b36d42a65ff0f89dba34649b76e08/tests/test_binarise.R#L12-L22

That will make it so your tests only fail (or stop failling) because of changes to your code, not because of outside factors like working on a train ride.

HedvigS commented 1 year ago

Some remarks on the test code:

  1. Put your test files into tests/testthat.
  2. Name them something like tests/testthat/test-binarise.R.
  3. Structure the test code in these little blocks:

    test_that("multistate features are being binarised", {
     data_with_multistate_features <- ...
     outcome <- binarise(data_with_multistate_features) %>% ...
     expected <- ...
     expect_equal(outcome, expected)
    })
    
    test_that("binarised features are not being overwritten", {
     data_with_binarised_features <- ...
     outcome <- binarise(data_with_binarised_features) %>% ...
     expected <- ...
     expect_equal(outcome, expected)
    })

(like this, basically)

If you do that you can just hit a key combination in RStudio (Cmd+Shift+T, apparently) and it will automatically find all your tests and run them. You'll get a nice summary in the Build tab:

ℹ Testing rgrambank
[...]
✔ | F W S  OK | Context
✔ |         2 | db

══ Results ═════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]

Advantages:

  1. You can get into a really nice workflow like ‘Change code – run tests – fix code – run tests – add test – run tests – fix code – run tests’ etc.
  2. Github can be set up to run tests automatically every time someone pushes to master or submits a PR, which is reeeally nice (we do that for our python programs and I assume it's probably possible with R as well).

Also, it's good practice to have your tests be as isolated as possible from the ‘messy real world’, i.e. not relying on a working internet connection or a remote file not being deleted:

https://github.com/grambank/rgrambank/blob/7a43c78ede5b36d42a65ff0f89dba34649b76e08/tests/test_binarise.R#L10

It might be better to read example data from a local file in the git repo or to manually create the data in the test program itself, like what you did with fake_raw_binary_data below (that's usually what I prefer):

https://github.com/grambank/rgrambank/blob/7a43c78ede5b36d42a65ff0f89dba34649b76e08/tests/test_binarise.R#L12-L22

That will make it so your tests only fail (or stop failling) because of changes to your code, not because of outside factors like working on a train ride.

Sure, I can do that if we want that. I just wrote these scripts quickly for myself without testthat to see that the code worked the way I expected.

@SimonGreenhill do you want all functions to have a formalised testthat workflow as Johannes suggests?

HedvigS commented 1 year ago

@johenglisch does the binarise-function work the way you expect it too based on our chat the other day?

SimonGreenhill commented 1 year ago

yes please! it'll make life better in the long run

HedvigS commented 1 year ago

yes please! it'll make life better in the long run

Alright, I'll try and find time later this week to do this.

HedvigS commented 1 year ago

I'm working in this branch now. i can't turn this PR into a draft, but if I could I would and turn it back when I'm done :)

HedvigS commented 1 year ago

alright @johenglisch and @SimonGreenhill I've made a stab at a testthat for binarise.

johenglisch commented 1 year ago

The tests look good. Just a tiny hint: You don't need to put the library(...) calls inside the individual test cases.

Like, as a general rule it's very good that the tests don't share any variables etc. among each other. We like every test to fail/succeed on its own, independent of other test code. But there are things that are considered ‘fair game’ for tests to share (esp. library imports and the occasional constant):

library(...)
library(...)

test_that("...", {
  ...
})

test_that("...", {
  ...
})

test_that("...", {
  ...
})

It just makes writing/changing tests a bit less tedious for you.

HedvigS commented 1 year ago

The tests look good. Just a tiny hint: You don't need to put the library(...) calls inside the individual test cases.

Like, as a general rule it's very good that the tests don't share any variables etc. among each other. We like every test to fail/succeed on its own, independent of other test code. But there are things that are considered ‘fair game’ for tests to share (esp. library imports and the occasional constant):

library(...)
library(...)

test_that("...", {
  ...
})

test_that("...", {
  ...
})

test_that("...", {
  ...
})

It just makes writing/changing tests a bit less tedious for you.

Thanks! I had those in because they were necessary when I was trial and erroring at one stage, but of course happy to remove now.

HedvigS commented 1 year ago

@johenglisch does the binarise-function work the way you expect it too based on our chat the other day?

@johenglisch besides the tests, does the function work the way you expect?

johenglisch commented 1 year ago

does the function work the way you expect?

These tests here look exactly like your notes on the whiteboard, so I'd say it's working:

https://github.com/grambank/rgrambank/blob/653f16b9127aa475255260133fe9ac532db50fb0/tests/testthat/test-binarise.R#L3-L9

HedvigS commented 1 year ago

@SimonGreenhill I'm gonna go ahead and re-write binarise now as per your email

HedvigS commented 1 year ago

i removed the need for stringr

HedvigS commented 1 year ago

@SimonGreenhill I'm done with tests/testthat/test-binarise.R and R/binarise.R.

Turning to language_level_df now

HedvigS commented 1 year ago

@SimonGreenhill I suspect something unfortunate has happened, because I made changes to language_level_df to take only long tables this week, but I think what got merged into here doesn't reflect that. i'm gonna do some git detective work and double check things.

HedvigS commented 1 year ago

@SimonGreenhill I suspect something unfortunate has happened, because I made changes to language_level_df to take only long tables this week, but I think what got merged into here doesn't reflect that. i'm gonna do some git detective work and double check things.

Yep, I can confirm the language_level_df script in this branch is older than in the branch in #3 . I'll go ahead and do the specific pull ins.

SimonGreenhill commented 1 year ago

Don't think I changed anything vital in that file beyond adding comments so you should be ok to just overwrite

HedvigS commented 1 year ago

@SimonGreenhill I have moved content over from #3 so that the function language_level_df again operates with long tables rather than wide. I think I've addressed all of your feedback on that function now and I made some other smaller improvements.

HedvigS commented 1 year ago

@SimonGreenhill binarise and language_level_df are ready for review again.