ucasir / NPRF

NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval
Apache License 2.0
32 stars 10 forks source link

How can I edit config file to run training? #8

Closed giangnguyen2412 closed 5 years ago

giangnguyen2412 commented 5 years ago

Hi again,

I am running the command for training: python nprf_drmm.py --fold 5 1 Then I am supposing to modify the config file: model/nprf_drmm_config.py, but how can I config this file. I modify the variable parent_path to parent_path = '/home/dexter/NPRF/model' (home directory), but it doesn't work.

Could you please help me out.

Thanks.

ucasir commented 5 years ago

The "parent_path" is the parent path for your data but not the home directory. You can ignore it anyway as long as your data path configuration is right. Did you prepare the data for training? If not, please find the related functions in the "utils" folder and generate the data first.

get the necessary relevance file from TREC result and qrels file

relevance_params = { 'res_file': '', 'qrels_file': '', 'docnolist_file': '', 'output_file': '' } create_relevance(**relevance_params)

get the topk tf-idf terms for each document

topk_params = { 'df_file': '', # the data format of each line is: term \t df \t cf (cf is not used) 'corpus_file': '', # the data format of each line is: docno \t doclen \t term1 term2 ... 'output_file': '', 'nb_docs': , 'topk': , } topk_term(**topk_params)

get the idf for topk terms

doc_idf_params = { 'relevance_file': '' , 'df_file': '' , 'document_file': '' , # the output file from above function 'output_file': 'topk.idf.pkl', 'rerank_topk': 60, 'doc_topk_term': 30, 'nb_doc': 25205179 } parse_idf_for_document(**doc_idf_params)

get similarity matrix, kernel and histogram features

kernel_mu_list = kernal_mus(11, True) kernel_sigma_list = kernel_sigmas(11, 0.5, True) sim_params = { 'relevance_file': '', 'topic_file': '', 'corpus_file': '', 'topk_corpus_file': '', 'embedding_file': '', 'stop_file': '', # not used actually 'sim_output_path': '', 'kernel_output_path': '', 'kernel_mu_list': kernel_mu_list, 'kernel_sigma_list': kernel_sigma_list, 'topk_supervised': 40, 'd2d': True, 'test': False }

hist_params = { 'relevance_file': '', 'text_max_len': , 'hist_size': , 'sim_path': , 'hist_path': , 'd2d': True }

sim_mat_and_kernel_d2d(sim_params) hist_d2d(hist_params)

giangnguyen2412 commented 5 years ago

Excuse me! Could you please specify in details how to prepare data for training. As you can see in README.MD, you just mention about utils.py file and some functions. I think I and others can not understand and follow to complete running training your model.

I think your instructions will really help.

Thank you.

giangnguyen2412 commented 5 years ago

You meant data preparation like this? https://github.com/NTMC-Community/MatchZoo

ucasir commented 5 years ago

Two quick questions : Do you have TREC robust04 or disk12 data? Have you retrieve a result file for those queries and download the qrels files from TREC website?

giangnguyen2412 commented 5 years ago

1) No, I dont have, so I need to download them? 2) No, I did not. Its my first time running an IR model.

Sorry for if silly questions.

ucasir commented 5 years ago

Please be familiar with IR first, e.g. index, retrieval, evaluation. Running some traditional (non-neural) IR experiments will also benefit you. This repo is not for anyone to learn IR from scratch.

giangnguyen2412 commented 5 years ago

Ok I will try running again and ask you later. Thanks for your help