starmpcc / Asclepius

Official Codes for "Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes"
84 stars 6 forks source link

README instructions very vague and not helpful #3

Closed spadenbargo closed 9 months ago

spadenbargo commented 10 months ago

There looks like a nice system of using letters for filenames, {A}... and so forth, however there is no explainations to what these files are and what they do and how they are shaped.

It isn't explicit anywhere how the training data is consumed since it looks like it is departing from the alpaca's sample instruction training json: {"instruction":,"input":,"output":}

starmpcc commented 10 months ago

Thank you for your interest in our project.

The {A} represents the save path for the first preprocessing step, so you can define it as you see fit.

Our dataset is utilized during the pretraining and instruction fine-tuning stages. Could you provide more specifics regarding your second question? In the case of the prompt, you can find it below: https://github.com/starmpcc/Asclepius/blob/15d0b9fef3562b83e59220ca03232a7b0c358f42/src/utils.py#L58-L70

spadenbargo commented 10 months ago

Hi thanks for the great work on this. I am working to further fine-tune to do a downstream task and provide it with my own set of instructions instead of pretraining and regenerating to replicate the whole model, however I am not very sure what the format is of this instruction set.

Could I just provide {I} a json file of an array of {"note": "foo","question": "bar","answer": "foobar"}?

Instruction Finetuning step:

    $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} \
        src/instruction_ft.py \
        --model_name_or_path {I} \
        --data_path {G} \
        --bf16 True \
        --output_dir ./checkpoints \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "epoch" \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --fsdp "full_shard auto_wrap" \
        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
        --tf32 True \
        --model_max_length 2048 \
        --gradient_checkpointing True
        --ddp_timeout 18000
starmpcc commented 10 months ago

Yes, the format you provided {"note": "foo", "question": "bar", "answer": "foobar"} is correct.

If you are encountering a JSON-related error, try saving the JSON file using the following line: df.to_json(args.save_path, orient="records", indent=4)