sunlab-osu / TURL

Code and data for "TURL: Table Understanding through Representation Learning"
Apache License 2.0
116 stars 27 forks source link

Missing a field in Entity Linking datasets #23

Open dalek-who opened 11 months ago

dalek-who commented 11 months ago

Here is the data example of EL provided in the README:

'23235546-1', # table id
'Ivan Lendl career statistics', # page title
'Singles: 19 finals (8 titles, 11 runner-ups)', # section title
'', # caption
['outcome', 'year', ...], # headers
[[[0, 4], 'Björn Borg'], [[9, 2], 'Wimbledon'], ...], # cells, [index, entity mention (cell text)]
[['Björn Borg', 'Swedish tennis player', []], ['Björn Borg', 'Swedish swimmer', ['Swimmer']], ...], # candidate entities, this the merged set for all cells. [entity name, entity description, entity types]
[0, 12, ...] # labels, this is the index of the gold entity in the candidate entities
[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

However, the final field:

[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

is only provided in the test split, while in the train and dev split, it is missing. How to generate this field?

belerico commented 9 months ago

I'm trying to understand the same here...

cc @xiang-deng @huan-sunrise

xiang-deng commented 9 months ago

Hi, as you can see in https://github.com/sunlab-osu/TURL/blob/bfec92e942a648695b3910aab42a6f0b679d37fc/data_loader/EL_data_loaders.py#L28 The field is not used for training. If I recall correctly, when tuning the model, I compute the loss against all candidates for the table, not individual cells, as it is more efficient.

The field is used at test time to compute the final metric, i.e. if the model predicts something that is not in the candidate set associated with the specific cell. We can ignore it. As such we only provide it for the test set. The logic is in evaluate_task.ipynb and data_processing.ipynb.

Let me know if you have other questions.