species and determinand reference tables

RobFryer commented 1 year ago

If a species is in the data but not in the species reference table, the code throws an error (which includes the list of the relevant species): see line 2219 in import_functions: ok <- ctsm_get_info(info$species, data$species, "assess") (The check is put here, rather than earlier, because many legacy species in the data have already been weeded out)

Would be good to create an element in the control list that could turn the default error into an optional warning that prints a similar message identifying the relevant species, but which then deletes the data with those species, and continues.

ctsm_get_info includes a call to ctsm_check_reference_table which does the checking. Rather than change ctsm_get_info, would make the code clearer to add an additional call to ctsm_check_reference_table beforehand, followed by code that deletes the relevant data. Could amend code so that it can output id (the vector of missing species) if required (i.e. if a warning has been specified rather than an error). Default should be an error so that ctms_get_info still works.

There is a similar problem for determinands, but only, I think, for external data.
This is in line 2039 of import functions: data$pargroup <- ctsm_get_info(info$determinand, data$determinand, "pargroup") Again, this could be preceded by a specific call to ctsm_check_reference_table as for species

Similarly, if a determinand is in the data but not in the determinand reference table line 2147 in import_functions: wk <- ctsm_check_determinands(info, data, determinands, determinands.control)

Would be good to create elements in the control list to turn this into a warning (default: error, optional: warning) and then allow the code to continue.

If there is a warning, all data with species that are not in the reference table should be deleted (to avoid failure later on) and all data with determinands that are not in the reference table should also be deleted.

morungos commented 1 year ago

Hmmm, checking the code, right now all this does is warn, although strictly not even a warning (using cat). So, the current status appears to be: generating a message and then dropping the data associated with the species, and then allowing the code to continue. If there is an error, it isn't here. Maybe ctsm_get_info is what stops the process, because (after looking at the code, see next) it will fail by default. But I don't really know.

Next, ctsm_get_info appears to be broken. na_action should one of c("fail", "input_ok", "output_ok", "ok"), but actually defaults to that vector! Strictly, it should probably default to a reasonable value, and then match.arg should do a validation, which it currently doesn't do, because choices aren't passed.

I think, basically, I need to start ramping up a "proper" test setup with actual files.

morungos commented 1 year ago

I've made decent progress adding in a testing system, so we can now run a data load. This also can check that our validation of uncertainty units works. We need a fair bit more still, but as a basic check that can be run, this is fine.

RobFryer commented 1 year ago

I'm as sure as I can be that ctsm_get_info is ok.

na.action can only take one value because argument serveral.ok = FALSE (default) in the call to match.arg

Also, na.action will equal "fail" by default, because if nothing is formally passed into the argument, match.arg takes the first value from the list of choices. This is from the match.arg help file: Since default argument matching will set arg to choices, this is allowed as an exception to the ‘length one unless several.ok is TRUE’ rule, and returns the first element.

About to look at your other comments

RobFryer commented 1 year ago

ok <- ctsm_get_info(info$species, data$species, "assess")

When this is called, ctsm_get_info will call ctsm_check_reference_table which will check whether all the elements of data$species are in the row.names of info$species and throw an error (with message) if they are not. So it will fail properly.

It's certainly not the clearest bit of code I've ever written!

morungos commented 1 year ago

Heck, you're right. I hadn't expected match.arg to be anything like that crazy. Probably the most unusual default behaviour I've seen in a very long time. I've never linked magic happening by default, so I for one would very much an explicit default.

RobFryer commented 1 year ago

Ha! match.arg is indeed a magical function. I'm afraid I've used it a lot, because it is so useful. But if you would like to replace it with an explicit default (whilst checking that it can only take one of a specified set of values), then please go ahead.

morungos commented 1 year ago

Yes, this is way too much magic to be clear to anyone less well-versed in R than you. However, you've used that pattern very consistently, and it is documented, I suppose, so I shall consider it something I have learned. I wouldn't to change one call, and changing all of them is not wise, so we'll leave for now.

morungos commented 1 year ago

Picking up on this now, and it's annoying. The issue is that ctsm_get_info does a whole lot of work before it calls ctsm_check_reference_table, and it doesn't necessarily even do that. So, what gets passed to ctsm_check_reference_table would have to replicate quite a bit of input cleanup from ctsm_get_info. I am sure there is a better way.

The way I'd handle this would be an error hook of some kind, defaulting to NULL, which will be called in the event of a failed check. That error hook can remove data that should be removed. We still want to amend ctsm_check_reference_table so that it can warn instead of erroring, but... that way we can inject the call into ctsm_get_info, so that it takes place in context, rather than needing to pre-compute all the cleanups. The alternative is to factor the cleanups into private functions, which probably should happen anyway, but is still duplicating work.

However, first of all, let's assemble a small unit test, so that we can replicate the stop condition and think about the cleanup process.

RobFryer commented 1 year ago

The simple way to assemble a small unit test is to remove a relevant species (or a relevant determinand) from the reference table. I think it should only fail on determinand using external data. Species will fail for both ICES data and external data.

An inelegant way of doing this would be to have an explicit call to check_reference_table before the relevant crunch points. And, as you say, allow check_reference_table to proceed with a warning. .

Note that doing any filtering of data in ctsm_get_info will cause a lot of changes as it currently returns a vector, rather than an augmented data frame. Would rather keep it separate.

morungos commented 1 year ago

The unit test part is the easy part -- I have that working.

As I said, there are architectural challenges with duplicating logic from ctsm_get_info. It does very weird things like access parse trees to derive default values for info_type -- it's about 60 lines of code which cannot be simply refactored. So adding an additional explicit call to ctsm_check_reference_table from outside ctsm_get_info is not trivial, far from it. I know that's what you'd prefer, but it's painful -- as well as inelegant.

However, as you said, we cannot really change the return value from ctsm_get_info, either.

Ultimately, the issue is architectural. We should not hook into ctsm_get_info, but also, ctsm_check_reference_table is too late in the process. We really need a new function ctsm_verify_info which behaves somewhat like ctsm_get_info, except it does not return any data, what it does is either (a) throw an error if requested, or (b) warn and redact the input, so that the new redacted input cam be safely passed to ctsm_get_info. This will involve a fair bit of duplication, but calling ctsm_check_reference_table is probably not useful, since we need to know the invalid values so we can redact them. And most of the complexity comes from the fact we need to handle both split and nonsplit input values.

And really, that means we should unit test ctsm_verify_info, which is itself a new thing, as we've not tested internals much so far. A test for ctsm_get_info should be a good start.

RobFryer commented 1 year ago

Totally agree with what you are saying.

However, one 'problem' with doing a verify_info call too early in the process (at least for ICES data) is that e.g. you would get a list of which species are not in the reference table, but you wouldn't know which ones you need to add to the reference table. Many of the species are legacy species, only sampled way back in time, and you only need to add those species that are currently monitored. Every year I have to add a few more (typically less than five, which contracted parties have started monitoring) and the way the code currently works, this is communicated really clearly because the checking is done after all the legacy data have been removed.

The real crunch with the current situation comes when using external data, and it is to do with the population of the pargroup column (which we don't require external users to provide). At this stage of the process, external data sets can have lots of deteminands that are of no interest at all and which again get weeded out before there are any subsequent calls to ctsm_get_info. So we could mitigate the issue considerably by adding the pargroup column much later in the process. I'll have a look and implement if possible.

This doesn't mean that a verify_info function wouldn't be useful / important. But dealing with pargroup later would certainly dramatically reduce the scale of the issue.

morungos commented 1 year ago

Things get added to the reference tables?!! How deep does this rabbit hole go? Or are we adding rabbits now?

Made some work on this, hopefully more today, these comments are pretty useful guidance. Thanks.

RobFryer commented 1 year ago

Sorry, didn't make myself clear. Nothing gets added to the reference tables.

With external data, users don't provide the pargroup column with the data (it is picked up from the reference table).
With ICES data, pargroup is provided in the data extraction.

So, for external data, we add a column to the data that contains the pargroup information - this comes from the reference table. But of course the pargroup information isn't there when the determinand is missing from the reference table. As (for external data) we do this (currently) very early in the process, it can fall over quite easily. If we do it later, then much of the problem will fade away.

osparcomm / HARSAT

species and determinand reference tables #353