Compute OB when creating dataset

imranraad07 / BugReportQA

0 stars 0 forks source link

Compute OB when creating dataset #47

Closed aciborowska closed 4 years ago

aciborowska commented 4 years ago

Computing OB between posts and answers is a time consuming process. It should be done only once, when we create the dataset to avoid repeating that step every time we run the model.

To do:

Move computing OB to a separate script that takes in post_data.tsv and qa_data.tsv and save output to utility.tsv (to avoid modification on scripts preparing data for Lucene).
Build new dataset with OB.
Modify building dataset for evpi model to read precomputed utilities from file.

aciborowska commented 4 years ago

Done.

future/src/data_generation/data_generator.py has only code to process github dataset, old Rao's code is removed.
future/src/data_generation/data_generator.py invokes compute_ob.py to produce utility_data.tsv.
utility_data.tsv is used as the input for evpi. All scripts updated with new parameters.

I uploaded dummy 1K dataset to google drive and still running script to generate github_20K. Once it's done I'll upload bigger dataset.