Closed tarakc02 closed 2 years ago
ideally, train/test split should separate on matched officer id -- so officers mentioned in articles in the test set should not be mentioned in any of the articles in the train set. The motivation for this is that we would like there to be no overlap between incidents in train and test, but since we don't have distinct incidents identified we use officer id as a standin (this means that when multiple officers are mentioned in the same story all of the mentioned officers should be part of the same split (either train or test).
@tarakc02 @ayyubibrahimi Okay! We've got a working import script, and you should now be able to do a git pull and build the train-test data.
I have a few more tweaks planned and I'm also adding notes about the task to the README for record-keeping, because there are a couple things worth highlighting and some oddities I'm following up on now. Notes should be done by end of day tomorrow and tweaks should be done by the end of this week, let me know if there's any questions/issues with the script otherwise!
Thanks @baileyb0t! @ayyubibrahimi, let us know here if you have any issues when you run make
on the import task.
Awesome. Thanks!
Up and running. Thanks @baileyb0t
@tarakc02 @ayyubibrahimi I finished adding notes to the README. I've still got a few tweaks to the script planned and I think there's a few things worth discussing at some point, namely what is described in depth under "Comments/Questions" in the README, but nothing that can't wait til we have the baseline model up.
@tarakc02 @baileyb0t
Baseline model is live. Please let me know if you have any issues running make
.
Thanks @ayyubibrahimi, will check it out as soon as I have a free moment. Going to close this issue now, we can create new ones to address some of the specific questions in @baileyb0t's note.
take the snapshot of data from basedash and assemble the pieces so that we have id, text, label ("relevant" or "not relevant") in a single table.