studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 102 forks source link

Have LUKE been tested on FIGER dataset? #133

Closed lshowway closed 2 years ago

lshowway commented 2 years ago

FIGER dataset is a entity typing dataset, similar to OpenEntity but with 2 million training samples. I wander have you tested LUKE on this dataset? I just change the date_dir from ../data/OpenEntity to ../data/FIGER,but I can not get an expected results, could you give me some advice?

ryokan0123 commented 2 years ago

Hi @lshowway. I don't think we have tried the FIGER dataset.

If you want to try a different dataset and it has a data format different from OpenEntity, you need to modify the reader code accordingly. https://github.com/studio-ousia/luke/blob/eff5d0ae528c544aa1d6e7b51bfcd76992d266bf/examples/legacy/entity_typing/utils.py#L40

lshowway commented 2 years ago

@Ryou0634 Thank you very much. FIGER has the same data format as OpenEntity, but it has 2 million samples, I just not sure how to set its batch size. Since the batch size on OpenEntity is 4, but it is too small on FIGER.

ryokan0123 commented 2 years ago

I see. Maybe you could follow other pretrained LM papers using the FIGER dataset. For example, I've just found this paper (ERNIE: Enhanced Language Representation with Informative Entities) mentions fine-tuning on the FIGER dataset as follows.

We also evaluate ERNIE on the distantly supervised dataset, i.e., FIGER (Ling et al., 2015). As the powerful expression ability of deeply stacked Transformer blocks, we found small batch size would lead the model to overfit the training data. Hence, we use a larger batch size and less training epochs to avoid overfitting, and keep the range of learning rate unchanged, i.e., batch size: 2048, number of epochs: 2, 3.

So a larger bachsize (say 2048) could help in this setting.

lshowway commented 2 years ago

@Ryou0634 Thanks for your detailed reply, it really solves my problems.