Proposal: Make the `CENS` column optional

billdenney commented 1 year ago

As described here, https://nlmixr2.github.io/rxode2/articles/rxode2-datasets.html#censlimit-columns, the CENS column doesn't necessarily give different information than having the DV and LIMIT columns:

Restating that documentation:

if LIMIT doesn't exist then CENS is implicitly zero
if LIMIT exists and CENS is 0, then it is ignored
if LIMIT < DV, then CENS is 1
if LIMIT > DV, then CENS is -1

It seems like we could simplify the data management requirements by removing the need for the CENS column by just making the CENS value implicit based on the value of the LIMIT column. The only modifications to the above would be:

if LIMIT is NA, then CENS is zero
if LIMIT == DV, then CENS is zero

For backward compatibility, we could use the user-provided CENS column, if it is provided. But otherwise, we could automatically make it based on these rules.

What do you think?

mattfidler commented 1 year ago

I think that the understanding is not quite correct, LIMIT is used to define the M2 and the M4 methods. The above would break that distinction.

mattfidler commented 1 year ago

I suppose if cens is missing you could impute 0 but that would make everything M2 method.

mattfidler commented 1 year ago

If you use what you are using you are imputing only a M4 method. The LIMIT is not the LLOQ or ULOQ it is actually the extra limit for M2 or M4. The LLOQ/ULOQ is actually captured in the DV value.

billdenney commented 1 year ago

But, in practice, aren't LIMIT and DV used to specify that, "The value is between LIMIT and DV?" And, the order of LIMIT and DV are arbitrary-- meaning that values of LIMIT = 1 and DV = 2 could effectively be the same as LIMIT = 2 and DV = 1?

I think that there is confusion about what is missing and the imputation: I was meaning that if CENS were missing because the column didn't exist, we could impute it accurately based just on the value of the LIMIT column:

LIMIT = NA means CENS = 0
LIMIT > DV means CENS = -1
LIMIT < DV means CENS = 1

The difference between M2 and M3 are just the difference between setting LIMIT to 0 or -Inf.

mattfidler commented 1 year ago

But, in practice, aren't LIMIT and DV used to specify that, "The value is between LIMIT and DV?"

It also contains information about left or right censoring which changes the likelihood.

mattfidler commented 1 year ago

So swapping LIMIT and DV changes the likelihood

billdenney commented 1 year ago

Isn't the likelihood value always the value of the integral from the lower of the two values to the upper of the two values? Is there any scenario where you would integrate in the opposite direction?

mattfidler commented 1 year ago

It is like left and right hypotheses tests. The integrals are different, and the likelihoods are different. So no, it is not.

If the model is fitting poorly for the censored time point, I suppose that you could integrate in the opposite direction.

mattfidler commented 1 year ago

That being said, you probably would not want to integrate in the opposite direction

mattfidler commented 1 year ago

Having CENS be optional could still work regardless. However, I do worry that we will make implications that the modeler may not have wanted. (Like M3 vs M4 vs M2). I have already have had emails about these sorts of cases and know that what you are proposing everyone doesn't follow.

I do wish people would post them here instead...

This is the reason I list the types of censoring methods used in the nlmixr2 output object because people are calling things they are not expecting to.

mattfidler commented 1 year ago

If changed it would be in rxode2parse where the underlying event table translation occurs.

billdenney commented 1 year ago

I had hoped this would make things simpler. If it will add confusion or cause other issues, we can close the issue. (I'll leave that to your judgement.)

mattfidler commented 1 year ago

I'm actually not sure. I think I have thought about this before too, but at the same time I about 3-4 people providing me datasets that don't match the definition I thought it should.

That makes me pause and wonder if this would do something I don't expect.

Perhaps we should ask others.

mattfidler commented 1 year ago

It also would only make it simpler for M4 censoring I guess.

mattfidler commented 1 year ago

And require a rewrite of geom_cens() too.

mattfidler commented 1 year ago

It also tells me the documentation is wrong and probably needs to be updated:

if LIMIT doesn't exist then CENS is implicitly zero
if LIMIT exists and CENS is 0, then it uses M2 censoring (this one is inconsistent because the M2 censoring depends on if the ipred is below or above the LIMIT
if LIMIT < DV, and CENS is 1, and it uses M4 censoring; It assumes that the censored observation is between LIMIT and DV
if LIMIT > DV, and CENS is -1, and it uses M4; It assumes that the censored observation is between LIMIT and DV

In theory if you are using M4 for missing observations you should use M2 for the other observations to be consistent, so you should have a limit for your censored and uncensored values.

mattfidler commented 1 year ago

Since I think it is more consistent to mix M2 and M4, I actually think that what is proposed pushes people to use M4 only and I'm unsure from a theoretical perspective I completely agree with that method.

If you are going to say that the likelihood is bound by zero then it should be bound by zero by all observations and missing observations should share this bound (in my opinion) and should use M2 and M4.

I could be wrong here, but maybe it is the right approach?

If you are using M3 you don't use the limit column at all.

mattfidler commented 1 year ago

As discussed today, probably not, though the documentation probably needs to be improved a bit.

nlmixr2 / rxode2parse

Proposal: Make the `CENS` column optional #41