innovationcore / LocalLLMStructured

4 stars 0 forks source link

The baseline results are too weak #1

Open JoakimEdin opened 1 year ago

JoakimEdin commented 1 year ago

I enjoyed your paper a lot! However, I have one big criticism: I think that the implemented BERT models are too weak. Your results are worse than those I've seen in the literature. Either your dataset is more challenging than MIMIC-III and MIMIC-IV, or you have not implemented the BERT models properly. I have some suggestions for how to get stronger baseline models and make the comparison with the LLMs more fair:

  1. Use a stronger decoder. Simply applying a linear layer on top of the average of the output word representations or the CLS token representation results in poor performance in multi-label classification (https://arxiv.org/abs/2103.06511). Try using PLM-ICD (https://aclanthology.org/2022.clinicalnlp-1.2/). It uses label-wise attention on the word representations and is currently the SOTA model on MIMIC-III and MIMIC-IV for automated medical coding. You can find the code for the model here: https://github.com/JoakimEdin/medical-coding-reproducibility/blob/main/src/models/plm_icd.py
  2. Setting the decision boundary threshold to 0.5 results in poor performance. You have to tune the decision boundary on the validation set. We demonstrated this in our paper: https://dl.acm.org/doi/10.1145/3539618.3591918
  3. Truncate the text to the same length when validating BERT and the LLMs. Truncating them to different lengths is not a fair comparison.
  4. Perform hyperparameter tuning on the models. Play with learning rate, dropout, and weight decay.

I hope these comments are not interpreted as self-promotion! I am genuinely interested in how LLMs perform on automated medical coding compared to other models! Also, cool to see a comparison on a new dataset!

JoakimEdin commented 1 year ago

Also, would it be possible to get more data statistics on the datasets you are using? Here are some examples of statistics I would find interesting:

codybum commented 1 year ago

@JoakimEdin Thank you very much for your helpful information. We had grand plans to test various input types, tuned parameters, and models for this paper, but ran out of time.

The focus of the paper was more of a methods/process instruction, rather than a fair comparison, but your point is taken. We have several projects in this space and will review your suggestions and see how we might improve future works.

We cut out a good bit of data to make the page limit, including stats on the dataset. @amullen34 could you post the stats requested in the previous post.

I would also like to see where we could perhaps collaborate on medical NLP.

Thanks again. Cody

amullen34 commented 1 year ago

Sure, here are those stats.

I don't have the exact number of words per document, but I can say that the average number of characters per document was 636. The average number of codes per case was 1.14. The majority of cases only had a single code attributed to them. Here are some overall stats on the distribution of codes (in the largest dataset):

Thank you, Aaron

JoakimEdin commented 1 year ago

Thank you very much for your response and the data statistics!

It is understandable there wasn't time to do everything that was planned.

I'm always up for collaboration! You can contact me at je@corti.ai.

Looking forward to hear from you!