Open csyjgu opened 3 years ago
I found two potential issues in recall_v2.py
tr_corpus, tr_urls, tr_bm25_model = build_bm25_model(tr_test_corpus, 'en', -1)
building turkish model but passing en
?parts = line[0].split('\001')[:3]
and urls.append(parts[0])
) seems will only keep the part before ','
for URLs if they contain a ','
. For example, the original URL is https://www.ntv.com.tr/amp/galeri/yasam/berguzar-korelden-cardi-byemuhtesem-yuzyilgondermesi,zzRFnP2kSUyyy7nYHFvgcQ
, but parts[0] will be https://www.ntv.com.tr/amp/galeri/yasam/berguzar-korelden-cardi-byemuhtesem-yuzyilgondermesi
.@ArvinZhuang
','
. but is has influence on content since content is long and has lots of ','
.thanks. these two problems are fixed and a new py file is attached.
no, it does not have effect on contents, but it does have effect on the url key you store for evaluation, and result you put in the final result if you know what i mean.
@wshuai190 you can try to use recall_v3.zip. if the problem is still there, you can send me the code (or lines number) which case the problem.
In your v3, you just disregarded urls that has comma in it if I understand it correctly. But why you do this? some information you extracted by this will be completly lost and not gonna be store in your bm25 index. Another is I wonder why when using your v2 on testing data. If using 'tr' in line 190 line 190 tr_corpus, tr_urls, tr_bm25_model = build_bm25_model(tr_test_corpus, 'en', -1) will result in lower score in leaderboard? (with tr got 0.153, with en got 0.159) what's the usage of tr process in this way? Also i don't understand why competition managing team don't give us standard evaluation method? and ways to extract and store your url in your evaluation file? I think this is a IR competition not a format checking competition.
@wshuai190 1.经过测试,含有逗号的URL可以正常读取,你可以试试运行代码,如果有问题可以把错误信息发来看看; 2.build_bm25_model函数中的en和tr参数只是使用了两种不同的分词器(WordNetLemmatizer和TurkishStemmer),从结果来看WordNetLemmatizer效果更好点,也可以尝试用其他不同的分词算法;
It seems the original code can not run directly and needs some modifcation. you can try this one if you are interested in game4.
recall_v2.zip