Open XiaohanYa opened 2 years ago
Hi,
Thank you for your interest in our research!
The code snippet you are referring to concerns the data processing for the pre-training step of the deepmatcher datasets as you stated.
We state in the paper that we are building a correspondence graph using the available training and validation pairs (but not test pairs). The code snippet you linked does indeed also use the test pairs which can be confusing I agree. In this processing file I did use everything to get the perfect mapping for later analysis purposes.
But the matching information from test pairs is never used during actual contrastive pre-training. If you have a look at the datasets.py file here you will see that I am actually generating the cluster ids again before pre-training but this time only using training and validation pairs, so the test labels are never used.
If your question is more regarding why we include entity descriptions from the test set at all during pre-training instead of leaving them out - this is due to the general task setup. We have two tables we want to match, so all relevant entity descriptions are known beforehand and can be used for pre-training. But we do take care not to spoil any entity matching information between both tables which is contained in the final test set. This circumstance is also why we need to apply source-aware sampling as otherwise any unknown linked entity descriptions (among which are the test pairs, which receive different cluster_ids even though they are the same entity) in each table would be explicitly pushed apart during contrastive pre-training.
I hope this answers your question, please let me know if something needs clarification!
Hi,
Thank you for sharing your code. I wonder if you could explain why you're using the test set for training in the pertaining step here. Many thanks.