create training data - Githubissues

ipno-llead / extraction

Extraction repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database

2 stars 0 forks source link

create training data #24

Closed tarakc02 closed 2 years ago

tarakc02 commented 2 years ago

take the snapshot of data from basedash and assemble the pieces so that we have id, text, label ("relevant" or "not relevant") in a single table.

tarakc02 commented 2 years ago

ideally, train/test split should separate on matched officer id -- so officers mentioned in articles in the test set should not be mentioned in any of the articles in the train set. The motivation for this is that we would like there to be no overlap between incidents in train and test, but since we don't have distinct incidents identified we use officer id as a standin (this means that when multiple officers are mentioned in the same story all of the mentioned officers should be part of the same split (either train or test).

baileyb0t commented 2 years ago

@tarakc02 @ayyubibrahimi Okay! We've got a working import script, and you should now be able to do a git pull and build the train-test data.

I have a few more tweaks planned and I'm also adding notes about the task to the README for record-keeping, because there are a couple things worth highlighting and some oddities I'm following up on now. Notes should be done by end of day tomorrow and tweaks should be done by the end of this week, let me know if there's any questions/issues with the script otherwise!

tarakc02 commented 2 years ago

Thanks @baileyb0t! @ayyubibrahimi, let us know here if you have any issues when you run make on the import task.

ayyubibrahimi commented 2 years ago

Awesome. Thanks!

ayyubibrahimi commented 2 years ago

Up and running. Thanks @baileyb0t

baileyb0t commented 2 years ago

@tarakc02 @ayyubibrahimi I finished adding notes to the README. I've still got a few tweaks to the script planned and I think there's a few things worth discussing at some point, namely what is described in depth under "Comments/Questions" in the README, but nothing that can't wait til we have the baseline model up.

ayyubibrahimi commented 2 years ago

@tarakc02 @baileyb0t Baseline model is live. Please let me know if you have any issues running make.

tarakc02 commented 2 years ago

Thanks @ayyubibrahimi, will check it out as soon as I have a free moment. Going to close this issue now, we can create new ones to address some of the specific questions in @baileyb0t's note.