neozhangthe1 / disambiguation

90 stars 41 forks source link

What's the meanings of dataset? #2

Open zzbzzb1413 opened 5 years ago

zzbzzb1413 commented 5 years ago

Hello! Thanks for sharing.Could I konw the meaning of two input files, name_to_pubs_train and name_to_pubs_test?

zfjsail commented 5 years ago

name_to_pubs_train contains matchings of persons and papers, which is to train global metric learning model and cluster size estimation model. name_to_pubs_test is for evaluation. Please see our paper for details.

zzbzzb1413 commented 5 years ago

谢谢您的回复! 感觉这个问题我用英语有点说不清楚,用中文了,哈哈。 就是我看name_to_pubs 这两个文件,最外层是个字典,然后外层字典的key是人名(作者),外层的value也是一个字典(内层字典)。内层字典的key和value分别是一个编码串和一个由编码串组成的list。

内层字典是我疑惑的地方,请问内层字典的key和value分别代表什么呢,是不是内层字典的key是某个会议,value list中的单个元素(如XXX-1)是这个会议下的论文(是不是XXX-1代表XXX论文的一作)呢?另,这个编码是怎么得到的呢,直接用论文和会议的名字可以吗?

十分感谢!

zfjsail commented 5 years ago

内层字典的key是person id,value是这个人发表的论文id列表。论文id, 如XXX-1表示这个作者是第几作者,从零开始计数。

name_to_pubs_train_500.json: This file can be used for training data, which includes name-person-paper mapping relations.

Data schema: This file is a dictionary (denoted as dic1) saved as a json object. The keys of dic1 are author name. The values of dic1 are person dictionary (denoted as dic2). The keys of dic2 are person id. The values of dic2 are list of paper ID authored by this person.

name_to_pubs_test_100.json: This file can be used for testing data, which includes name-person-paper mapping relations. Its data schema is the same as name_to_pubs_train_500.json.

zzbzzb1413 commented 5 years ago

另问,最终消歧的聚类结果是需要自己保存吧(我在train.py中看到了一行调用了clustering,它的结果就是聚类结果吧)?. 谢谢!

zfjsail commented 5 years ago

yes. The disambiguation results are obtained by clustering (in train.py).