jbdel / vilmedic

ViLMedic (Vision-and-Language medical research) is a modular framework for vision and language multimodal research in the medical field
MIT License
151 stars 20 forks source link

Content Consulting, about R2Gen #21

Closed XNLHZ closed 4 months ago

XNLHZ commented 5 months ago

Thanks for the codebase, and I would like to ask if the newer implementation of the R2Gen model is included in this codebase.

jbdel commented 4 months ago

Hello,

While the architecture of R2Gen is not directly implemented in ViLMedic, you can replicate similar results with this simple baseline:

python bin/train.py config/RRG/baseline-mimic.yml \
    dataset.seq.processing=ifcc_clean_report \
    dataset.image.root=data/RRG/mimic-cxr/findings/ \
    dataset.seq.root=data/RRG/mimic-cxr/findings/ \
    dataset.seq.file=findings.tok \
    dataset.seq.tokenizer_max_len=128 \
    dataset.image.file=image.tok \
    dataset.image.image_path=data/images/ \
    dataset.image.multi_image=3 \
    model.cnn.backbone=densenet121 \
    model.cnn.visual_projection.in_features=1024 \
    model.cnn.visual_projection.out_features=768 \
    trainor.batch_size=16 \
    trainor.grad_accu=8 \
    trainor.optim_params.lr=0.0003 \
    trainor.optimizer=Adam \
    trainor.early_stop_metric=bertscore \
    trainor.early_stop=10 \
    validator.batch_size=8 \
    validator.beam_width=2 \
    validator.metrics='[bertscore]' \
    validator.splits='[validate]' \
    ckpt_dir=ckpt \
    name=nll_findings_bertscore_128
You can generate the tok file using the scripts in `data/make_datasets/`
Best,