ipno-llead / extraction

Extraction repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database
2 stars 0 forks source link

text of all articles #27

Closed tarakc02 closed 2 years ago

tarakc02 commented 2 years ago

Ayyub's model code also includes, optionally, the ability to first pre-train a language model on a larger corpus of data. For the time being, we can use train-test.parquet for that step, but this would be a good use of the full news data set (as @ayyubibrahimi comments in his script), one row per article id, with just three columns: article_id, title, content. This can happen in its own script, but should still be in the news-classification/import task.

tarakc02 commented 2 years ago

Hmmn: maybe it should only include articles that passed through the keyword filter, rather than all articles. This would still give a decent amount of content.

baileyb0t commented 2 years ago

Okay, I added "news.parquet" to the import task.

At the moment, it captures all 30k articles in the merged database, but there are a few ways we could build this pre-training data:

  1. All articles (current)
  2. All kw_match articles
  3. A subset of kw_match articles that are not in train-test data
  4. A random sample

In the event we choose a subset or random sample, the log can capture counts of kw_match, relevant articles and overlap with train-test.

tarakc02 commented 2 years ago

awesome! I keep going back and forth, but I think it'll be interesting to do the pre-training on the full set of articles (current).