Closed damevski closed 4 years ago
Currently, join_dataset.py
use all files in the given directory with name matching "github_data_20*.csv" regex to build dataset, so as long as csv with edit data is unzipped it will be added to the dataset.
One question: what do we want to do if a post has been answered in the comment and it also has edit data? I think we believe more strongly in edited data, so I vote for keeping <post, question, answer_edit> and rejecting <post, question, answer_comment>.
Answer: we will keep the edits if both are present
Combine the data we got for edited posts with the data based on the comments into a complete dataset.