BioFEG

Data Preparation

Cleate a data fold
Download the datasets from https://github.com/dhdhagar/arboEL, place it under data
Prepare data: python data_process.py

Run

Biencoder

We train our biencoder in an iterative way: train biencoder -> train gan -> generate latent features -> finetune biencoder. We train our biencoder on in-batch negatives in the first iteration and on hard negatives in the following iterations.

train biencoder

train biencoder on in-batch negatives: PYTHONPATH=. python blink/biencoder/train_biencoder.py --data_path data/medmentions/processed --output_path models/medmentions/biencoder --learning_rate 1e-05 --num_train_epochs 1 --train_batch_size 64 --eval_batch_size 32 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers

train biencoder on hard negatives: PYTHONPATH=. python blink/biencoder/train_biencoder_hard.py --data_path models/medmentions/finetune/top64_candidates --output_path models/medmentions/biencoder --path_to_model models/medmentions/finetune/pytorch_model.bin --learning_rate 1e-05 --num_train_epochs 1 --train_batch_size 4 --eval_batch_size 4 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers

train gan PYTHONPATH=. python blink/biencoder/train_gan.py --data_path data/medmentions/processed --output_path models/medmentions/gan --path_to_model models/medmentions/biencoder/pytorch_model.bin --learning_rate 5e-05 --num_train_epochs 80 --train_batch_size 64 --eval_batch_size 128 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers
generate latent features

generate latent features on in-batch negatives: PYTHONPATH=. python blink/biencoder/generate_features.py --data_path data/medmentions/processed --entity_dict_path data/medmentions/documents/all_documents.json --gan_path models/gan/epoch_49 --output_path models/medmentions/gan --path_to_model models/medmentions/biencoder/pytorch_model.bin --encode_batch_size 64 --top_k 64 --bert_model SapBERT-from-PubMedBERT-fulltext

generate latent features on hard negatives: PYTHONPATH=. python blink/biencoder/generate_features_hard.py --data_path data/medmentions/processed --entity_dict_path data/medmentions/documents/all_documents.json --gan_path models/gan/epoch_49 --output_path models/medmentions/gan --path_to_model models/medmentions/biencoder/pytorch_model.bin --encode_batch_size 64 --top_k 64 --bert_model SapBERT-from-PubMedBERT-fulltext

finetune biencoder

finetune biencoder on in-batch negatives: PYTHONPATH=. python blink/biencoder/finetune.py --data_path data/medmentions/processed --generate_data_path models/medmentions/gan/candidates_50/ --output_path models/medmentions/finetune --path_to_model models/medmentions/finetune/pytorch_model.bin --learning_rate 1e-09 --num_train_epochs 1 --train_batch_size 64 --eval_batch_size 32 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers

finetune biencoder on hard negatives: PYTHONPATH=. python blink/biencoder/finetune_hard.py --data_path data/medmentions/processed --generate_data_path models/medmentions/gan/candidates_50/ --output_path models/medmentions/finetune --path_to_model models/medmentions/finetune/pytorch_model.bin --learning_rate 1e-09 --num_train_epochs 1 --train_batch_size 64 --eval_batch_size 32 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers

eval biencoder: We eval biencoder after training biencoder or finetuning biencoder. PYTHONPATH=. python blink/biencoder/eval_biencoder.py --data_path data/medmentions/processed --entity_dict_path data/medmentions/documents/all_documents.json --output_path models/medmentions --path_to_model models/medmentions/finetune/pytorch_model.bin --encode_batch_size 64 --eval_batch_size 8 --top_k 64 --bert_model SapBERT-from-PubMedBERT-fulltext --mode train,valid,test,test_seen,test_unseen

Cross-encoder

PYTHONPATH=. python blink/crossencoder/train_cross.py --data_path models/medmentions/top64_candidates/ --output_path models/medmentions/crossencoder --learning_rate 2e-05 --num_train_epochs 5 --train_batch_size 1 --eval_batch_size 1 --bert_model SapBERT-from-PubMedBERT-fulltext --type_optimization all_encoder_layers --add_linear

If you use our code in your work, please cite us.

Xuhui Sui, Ying Zhang, Xiangrui Cai, Kehui Song, Baohang Zhou, Xiaojie Yuan and Wensheng Zhang. BioFEG: Generate Latent Features for Biomedical Entity Linking. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023).

suixuhui / BioFEG

readme

BioFEG

Data Preparation

Run

Biencoder

Cross-encoder