Closed tarakc02 closed 2 years ago
Hmmn: maybe it should only include articles that passed through the keyword filter, rather than all articles. This would still give a decent amount of content.
Okay, I added "news.parquet" to the import task.
At the moment, it captures all 30k articles in the merged database, but there are a few ways we could build this pre-training data:
In the event we choose a subset or random sample, the log can capture counts of kw_match, relevant articles and overlap with train-test.
awesome! I keep going back and forth, but I think it'll be interesting to do the pre-training on the full set of articles (current).
Ayyub's model code also includes, optionally, the ability to first pre-train a language model on a larger corpus of data. For the time being, we can use train-test.parquet for that step, but this would be a good use of the full news data set (as @ayyubibrahimi comments in his script), one row per article id, with just three columns:
article_id
,title
,content
. This can happen in its own script, but should still be in thenews-classification/import
task.