seraphinatarrant / embedding_bias

Repo for project on the geometry of Word Embeddings and how it influences bias downstream
4 stars 2 forks source link

Question about Data Preprocessing #7

Open kato8966 opened 1 year ago

kato8966 commented 1 year ago

The section "Data/English Coreference:" says, "The text was extracted and cleaned, to have one Wikipedia paragraph per line, then downsampled and tokenised using the NLTK tokeniser, ..." Could you tell me how to downsample the data? I cannot find any relevant file in data_cleaning/wiki_data_cleaning.