rmnldwg / lymph

Python package for statistical modelling of lymphatic metastatic spread in head & neck cancer.
https://lymph-model.readthedocs.io
MIT License
5 stars 4 forks source link

encoding of diagnose/involvement patterns does not work for trinary #59

Closed rmnldwg closed 6 months ago

rmnldwg commented 7 months ago

The function matrix.compute_encoding() can currently only compute the correct data encoding for the binary model case.

YoelPH commented 7 months ago

The new encoding looks nice! I just found some issues that are not considered yet (I think):

The code correctly checks the base to assign the correct element_map. But, the base is not provided when compute_encoding is called in unilateral and I guess neither in bilateral.

In the risk function (e.g. unilateral line 926) we would need to add the base assignment for example like this:

        marginalize_over_states = matrix.compute_encoding(
            lnls=self.graph.lnls.keys(),
            pattern=involvement,
            base = len(self.graph.allowed_states)
        )
        return marginalize_over_states @ posterior_state_dist 

But then the issue comes up that self.graph.allowed_states is also not functional yet. Here the issue is in line 539 of graph. We access the value wrongly. A fix would be:

next(iter(model.lnls.values()), None)

Now the risk function is functional :) but The encoding does not properly consider how to encode trinary states yet. The diagnoses for Trinary diagnose states are still provided as only True and False. Hence we can fix this either with:

rmnldwg commented 7 months ago

The code correctly checks the base to assign the correct element_map. But, the base is not provided when compute_encoding is called in unilateral and I guess neither in bilateral.

True! I'll fix that similar to what you suggested.

But then the issue comes up that self.graph.allowed_states is also not functional yet.

Huiuiui, Ok, I fixed that too 🙈

The diagnoses for Trinary diagnose states are still provided as only True and False.

Yes, that's on purpose! The output of the compute_encoding() function is the answer to "which states match the given pattern of involvement?". And that must always be binary, regardless of whether the model itself is binary or trinary. Because in the end it is used to marginalize over a set of probabilities.

YoelPH commented 7 months ago

Ahh, yes. Now that I think again about it, we are considering these cases with Sensitivity and Specificity later on.

rmnldwg commented 7 months ago

No, I think you're right! There's still a mistake in the generate_data_encoding() function: This one does not take into account that reports from pathology and clinical stuff should yield different encodings (they should, right?).

But it should be fixable without touching the actual encoding.

YoelPH commented 7 months ago

No, I think you're right! There's still a mistake in the generate_data_encoding() function: This one does not take into account that reports from pathology and clinical stuff should yield different encodings (they should, right?).

But it should be fixable without touching the actual encoding.

Yeah, this was my guess. But I am not quite sure yet what needs to be changed for it to run properly. Right now we also have a dimensional problem that occures when you try to run the code... So I am thinking about what changes need to be done

rmnldwg commented 7 months ago

Ok, let's break it down:

  1. Diagnoses are always binary
  2. We have a confusion matrix for every diagnosic modality that maps binary diagnoses to trinary states which has shape $2 \times 3$
  3. Therefore, the observation matrix has shape $2^{L \cdot M} \times 3^L$ if we have $L$ lymph node levels and $M$ modalities.
  4. The observation matrix essentially translates a binary data encoding for each modality into the probabilities for different trinary states, based on the sensitivity/specificity of all considered modalities.
  5. That means we do not need to change the data encoding (but maybe a .T is still missing here or there 😅) and everything should be correctly implemented.

Did I miss anything? Does that sound right? 🤔

YoelPH commented 7 months ago

I do agree. With what you wrote. The only issue I encountered is that the encoding does not have the proper length. The one-hot encoded vector should be of length $2^{L\cdot M}$ to be multiplied with the observation matrix. Right now the resulting one-hot encoded vectors are of length $3^{L\cdot M}$ which is why I assumed there to be a problem with the encoding.

rmnldwg commented 7 months ago

Yes, I think I have verschlimmbessert the situation at some point with a wrong fix. I found another issue that might have been the culprit of the dimension mismatch: A trinary confusion matrix could be only of shape $2 \times 2$ if it was computed at the wrong time, leading to some errors. I fixed that, too now.

rmnldwg commented 7 months ago

Try again now: I have added a risk test for the trinary model and changed the base argument to the matrix.compute_encoding() function inside the comp_diagnose_encoding() method to binary (i.e., 2) always (because diagnoses are always binary). It runs smoothly and yields plausible results (meaning it is between 0 and 1).

YoelPH commented 7 months ago

I just ran some tests and it seems like it works the way it is intended :) Thanks for the fix!