Closed baileyb0t closed 2 years ago
['article_id', 'title', 'content']
df.sample()
thank you! re your note: yes, this is what we want (all text, including those in train and test). since we do not use labels during pre-training, we're not worried about information leakage from test.
['article_id', 'title', 'content']
df.sample()
currently uses n=news.shape[0] so all of the unique articles in merged database are written, including those in train-test data