content_sharing_network
, references
, and semantic_shift
.Set up the environment using Python VirtualEnv. From the root directory, run:
python -m venv venv/
Activate the environment just created:
source venv/bin/activate
Install the dependencies:
pip install -r requirements.txt
The corpus used in this work was a combination of the NELA-GT-2020 and NELA-GT-2021 datasets, which can be downloaded in SQLite and JSON formats.
We filtered the corpus to retrieve only articles related to COVID-19 using a keyword matching procedure. The keywords can be found in data/CDC+COVID_vocab.txt
. We selected all articles for which the title
OR content
had a match with at least one keyword from the list.
The SQLite database can be converted into CSV by using the script preprocessing/nela_to_csv.py
.
Train a triplet loss using the triplets in data/triplets
using the escript ensemble/train_source_embeddings.py
.
cd ensemble
python train_source_embeddings.py
Pre-computed triplets are provided in data/triplets
. If you want to compute your own triplets, you can do so using the following scripts:
references/jargon_triplets.py
.references/stance_triplets.py
.content_sharing_network/csn_features.py
.semantic_shift/semantic_shift_triplets.py
.The pre-trained source embeddings models can be found in directory model
. Look for any file with extension .emb
.
Most experiments require the pre-trained source embedding models from found in the model
folder, in addition to the source labels found in the data
folder.