som-shahlab / ehrshot-benchmark

A benchmark for few-shot evaluation of foundation models for electronic health records (EHRs)
https://ehrshot.stanford.edu
Apache License 2.0
134 stars 9 forks source link

How to track which diagnoses become which tokens when clmbr batches are created? #13

Closed ulzee closed 2 months ago

ulzee commented 2 months ago

Hello, I was wondering what would be the simplest way to check which token corresponds to eg. which SNOMED code. I was trying to infer from the dictionary object but this did not seem directly possible.

Miking98 commented 2 months ago

Sorry for the confusion @ulzee and thanks for the comment!

Please run this script to view this data: https://github.com/som-shahlab/ehrshot-benchmark/blob/033715c3d5ed873c3fd2ab3cbc408d0efaf733ee/ehrshot/convert_dictionary_to_json.py

It will generate three files in EHRSHOT_ASSETS/models/clmbr:

We will update EHRSHOT_ASSETS in our next version of the dataset release to include these files by default.

ulzee commented 2 months ago

Thank you for the clarifications. I think I'm still a bit lost on the tokens produced by femr.models.dataloader.BatchLoader because they are in the range of 0-65535, but dictionary.json contains 1729229 medical concepts. I assumed then concepts 65536-1729229 are not used or there is a many to few reduction somewhere. Or I'm missing something about the tokenization.