Closed lshowway closed 2 years ago
Hi @lshowway. I don't think we have tried the FIGER dataset.
If you want to try a different dataset and it has a data format different from OpenEntity, you need to modify the reader code accordingly. https://github.com/studio-ousia/luke/blob/eff5d0ae528c544aa1d6e7b51bfcd76992d266bf/examples/legacy/entity_typing/utils.py#L40
@Ryou0634 Thank you very much. FIGER has the same data format as OpenEntity, but it has 2 million samples, I just not sure how to set its batch size. Since the batch size on OpenEntity is 4, but it is too small on FIGER.
I see. Maybe you could follow other pretrained LM papers using the FIGER dataset. For example, I've just found this paper (ERNIE: Enhanced Language Representation with Informative Entities) mentions fine-tuning on the FIGER dataset as follows.
We also evaluate ERNIE on the distantly supervised dataset, i.e., FIGER (Ling et al., 2015). As the powerful expression ability of deeply stacked Transformer blocks, we found small batch size would lead the model to overfit the training data. Hence, we use a larger batch size and less training epochs to avoid overfitting, and keep the range of learning rate unchanged, i.e., batch size: 2048, number of epochs: 2, 3.
So a larger bachsize (say 2048) could help in this setting.
@Ryou0634 Thanks for your detailed reply, it really solves my problems.
FIGER dataset is a entity typing dataset, similar to OpenEntity but with 2 million training samples. I wander have you tested LUKE on this dataset? I just change the date_dir from
../data/OpenEntity
to../data/FIGER
,but I can not get an expected results, could you give me some advice?