The baseline results are too weak

JoakimEdin commented 1 year ago

I enjoyed your paper a lot! However, I have one big criticism: I think that the implemented BERT models are too weak. Your results are worse than those I've seen in the literature. Either your dataset is more challenging than MIMIC-III and MIMIC-IV, or you have not implemented the BERT models properly. I have some suggestions for how to get stronger baseline models and make the comparison with the LLMs more fair:

Use a stronger decoder. Simply applying a linear layer on top of the average of the output word representations or the CLS token representation results in poor performance in multi-label classification (https://arxiv.org/abs/2103.06511). Try using PLM-ICD (https://aclanthology.org/2022.clinicalnlp-1.2/). It uses label-wise attention on the word representations and is currently the SOTA model on MIMIC-III and MIMIC-IV for automated medical coding. You can find the code for the model here: https://github.com/JoakimEdin/medical-coding-reproducibility/blob/main/src/models/plm_icd.py
Setting the decision boundary threshold to 0.5 results in poor performance. You have to tune the decision boundary on the validation set. We demonstrated this in our paper: https://dl.acm.org/doi/10.1145/3539618.3591918
Truncate the text to the same length when validating BERT and the LLMs. Truncating them to different lengths is not a fair comparison.
Perform hyperparameter tuning on the models. Play with learning rate, dropout, and weight decay.

I hope these comments are not interpreted as self-promotion! I am genuinely interested in how LLMs perform on automated medical coding compared to other models! Also, cool to see a comparison on a new dataset!

JoakimEdin commented 1 year ago

Also, would it be possible to get more data statistics on the datasets you are using? Here are some examples of statistics I would find interesting:

Number of words per document.
Number of codes per case.
The code distribution of the dataset.

codybum commented 1 year ago

@JoakimEdin Thank you very much for your helpful information. We had grand plans to test various input types, tuned parameters, and models for this paper, but ran out of time.

The focus of the paper was more of a methods/process instruction, rather than a fair comparison, but your point is taken. We have several projects in this space and will review your suggestions and see how we might improve future works.

We cut out a good bit of data to make the page limit, including stats on the dataset. @amullen34 could you post the stats requested in the previous post.

I would also like to see where we could perhaps collaborate on medical NLP.

Thanks again. Cody

amullen34 commented 1 year ago

Sure, here are those stats.

I don't have the exact number of words per document, but I can say that the average number of characters per document was 636. The average number of codes per case was 1.14. The majority of cases only had a single code attributed to them. Here are some overall stats on the distribution of codes (in the largest dataset):

There are 163 different codes, the most frequent of which appears 332 times, and the least frequent ones appear only once.
The top 20 most common codes make up about 60% of the total occurrences.
49% of the codes had less than 10 occurrences. If you'd like a more detailed look at the distribution, I've attached a CSV that contains every code and its number of occurrences. codeFrequencies.csv

Thank you, Aaron

JoakimEdin commented 1 year ago

Thank you very much for your response and the data statistics!

It is understandable there wasn't time to do everything that was planned.

I'm always up for collaboration! You can contact me at je@corti.ai.

Looking forward to hear from you!

innovationcore / LocalLLMStructured

The baseline results are too weak #1