luoda888 / 2021-DIGIX-BASELINE

2021 huawei DIGIX competition baseline
70 stars 36 forks source link

game4 baseline modification #3

Open csyjgu opened 3 years ago

csyjgu commented 3 years ago

It seems the original code can not run directly and needs some modifcation. you can try this one if you are interested in game4.

recall_v2.zip

ArvinZhuang commented 3 years ago

I found two potential issues in recall_v2.py

  1. line 190 tr_corpus, tr_urls, tr_bm25_model = build_bm25_model(tr_test_corpus, 'en', -1) building turkish model but passing en?
  1. the way of reading test collection (parts = line[0].split('\001')[:3] and urls.append(parts[0])) seems will only keep the part before ',' for URLs if they contain a ','. For example, the original URL is https://www.ntv.com.tr/amp/galeri/yasam/berguzar-korelden-cardi-byemuhtesem-yuzyilgondermesi,zzRFnP2kSUyyy7nYHFvgcQ, but parts[0] will be https://www.ntv.com.tr/amp/galeri/yasam/berguzar-korelden-cardi-byemuhtesem-yuzyilgondermesi.
csyjgu commented 3 years ago

@ArvinZhuang

  1. yes.
  2. yes. it has no big influence on title because title usually has no ','. but is has influence on content since content is long and has lots of ','.

thanks. these two problems are fixed and a new py file is attached.

recall_v3.zip

wshuai190 commented 3 years ago

no, it does not have effect on contents, but it does have effect on the url key you store for evaluation, and result you put in the final result if you know what i mean.

csyjgu commented 3 years ago

@wshuai190 you can try to use recall_v3.zip. if the problem is still there, you can send me the code (or lines number) which case the problem.

wshuai190 commented 3 years ago

In your v3, you just disregarded urls that has comma in it if I understand it correctly. But why you do this? some information you extracted by this will be completly lost and not gonna be store in your bm25 index. Another is I wonder why when using your v2 on testing data. If using 'tr' in line 190 line 190 tr_corpus, tr_urls, tr_bm25_model = build_bm25_model(tr_test_corpus, 'en', -1) will result in lower score in leaderboard? (with tr got 0.153, with en got 0.159) what's the usage of tr process in this way? Also i don't understand why competition managing team don't give us standard evaluation method? and ways to extract and store your url in your evaluation file? I think this is a IR competition not a format checking competition.

chunlin93 commented 3 years ago

@wshuai190 1.经过测试,含有逗号的URL可以正常读取,你可以试试运行代码,如果有问题可以把错误信息发来看看; 2.build_bm25_model函数中的en和tr参数只是使用了两种不同的分词器(WordNetLemmatizer和TurkishStemmer),从结果来看WordNetLemmatizer效果更好点,也可以尝试用其他不同的分词算法;