mgruppi / SciLander

Codebase for SciLander
MIT License
1 stars 3 forks source link

SciLander: Mapping the Scientific News Landscape

Repository Structure

Setup

Set up the environment using Python VirtualEnv. From the root directory, run:

python -m venv venv/

Activate the environment just created:

source venv/bin/activate

Install the dependencies:

pip install -r requirements.txt

Corpus

The corpus used in this work was a combination of the NELA-GT-2020 and NELA-GT-2021 datasets, which can be downloaded in SQLite and JSON formats.

We filtered the corpus to retrieve only articles related to COVID-19 using a keyword matching procedure. The keywords can be found in data/CDC+COVID_vocab.txt. We selected all articles for which the title OR content had a match with at least one keyword from the list.

The SQLite database can be converted into CSV by using the script preprocessing/nela_to_csv.py.

Training a triplet loss model

Train a triplet loss using the triplets in data/triplets using the escript ensemble/train_source_embeddings.py.

cd ensemble
python train_source_embeddings.py

Computing triplets

Pre-computed triplets are provided in data/triplets. If you want to compute your own triplets, you can do so using the following scripts:

Pre-trained models

The pre-trained source embeddings models can be found in directory model. Look for any file with extension .emb.

Experiments

Most experiments require the pre-trained source embedding models from found in the model folder, in addition to the source labels found in the data folder.