LLM-based entity simplification for BRONCO Data

kunalr97 commented 1 month ago

Hi,

I wanted to extend my candidate generation approach on the BRONCO Dataset (as show in the example https://github.com/hpi-dhc/xmen/blob/main/examples/01_BRONCO_German.ipynb with the LLM-based entity simplification as show in your extended paper.

I came across the repo here but it it shows an example only for the SympTEMIST Task. https://github.com/hpi-dhc/symptemist/blob/main/1_LLM_Simplification.ipynb

Would be interesting to see how this approach might work for the BRONCO Dataset and compare to the previous baseline ?

Thanks in advance.

Best, Kunal

phlobo commented 1 month ago

Thank you for your question!

Basically https://github.com/hpi-dhc/symptemist/blob/main/1_LLM_Simplification.ipynb inserts an additional step after candidate generation.

So you should be able to easily adapt the SympTEMIST notebook by:

replacing the code for data loading and candidate generation with the BRONCO-specific pieces
renaming the file for caching LLM calls table_file to something like bronco[...].pkl
adapt the fixed_few_shot_examples argument of the GPTSimplifier. You can probably start with an empty list to have no examples at all. This might actually be the biggest lever you have for improving performance, we didn't really optimize this for SympTEMIST and it worked very well out-of-the-box
you can skip the part Determine Optimal Cutoff for now and just work with the default cutoff (0.85), which works well in most cases
do the cross-encoder re-ranking as in the original BRONCO notebook

I would like to point out that SympTEMIST benefited from this approach a lot, as it has many very long mention spans, that are hard to link. Mentions in BRONCO are much shorter on average, so you might have to think about ways in which rephrasing would benefit candidate generation performance (and adapt the few shot examples accordingly). I assume there is quite a lot of potential for treatments in BRONCO (rephrasing mentions to make them more similar to terms in OPS), but maybe less so for diagnoses and medications, where candidate generation recall is already quite high.

kunalr97 commented 1 month ago

Thank you so much for your reply and insights! I will try it out ASAP.

hpi-dhc / xmen

LLM-based entity simplification for BRONCO Data #36