updated to write pre-training data as news.parquet

ipno-llead / extraction

Extraction repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database

2 stars 0 forks source link

updated to write pre-training data as news.parquet #33

Closed baileyb0t closed 2 years ago

baileyb0t commented 2 years ago

"import/makefile" actually writes news.parquet
"import/src/import.py"
- creates news dataframe with columns ['article_id', 'title', 'content']
- drops duplicate article_ids, shuffles rows, writes file
- NOTE: df.sample() currently uses n=news.shape[0] so all of the unique articles in merged database are written, including those in train-test data
"README.md" updated to reflect changes in output written

tarakc02 commented 2 years ago

thank you! re your note: yes, this is what we want (all text, including those in train and test). since we do not use labels during pre-training, we're not worried about information leakage from test.