Issue about preparing dataset for pre-training ?

taokz / BiomedGPT

BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks

Apache License 2.0

363 stars 34 forks source link

Issue about preparing dataset for pre-training ? #10

Closed nghiemkythu closed 7 months ago

nghiemkythu commented 7 months ago

Hello, Thank you for your interesting work. I am trying to pre-train this model on my own large-scale medical dataset (this dataset is about VQA, so I can only create file vision_language.tsv). However, when I run following the instruction, I see that this model need the folder negative_sample to run. How can I create files in this folder (all_captions.txt, object.txt and type2ans.json).

taokz commented 7 months ago

You can directly download the files via this link, or you can create it by yourself to meet your requirements.

nghiemkythu commented 7 months ago

Thank you for your answer. I see the folder negative_sample in this link is quite the same as the folder in original OFA. I do not know if I can use this folder for medical pre-trained task.

taokz commented 7 months ago

I also used the same negative samples, and the experimental results shows that it is ok. However, I have the same concern like you, and I think customizing the in-domain negative samples may further improve the performance with high probability.

nghiemkythu commented 6 months ago

Thank you for your answer. I will try this folder first and check if it can run well.

taokz commented 6 months ago

@nghiemkythu You are very welcome! If you have further questions, feel free to let me know! I recently should update this repo because I achieved better performance on downstream tasks and the model was evaluated on more datasets. However, I am busy with my qualifier and will update the code once I complete it.