stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

gloVe performances #196

Closed lfoppiano closed 2 years ago

lfoppiano commented 3 years ago

Hi all, it's been a while now that I've been trying to train gloVe on a large dataset (1.2Tb).

The script successfully created the vocabulary but it's been like two month that it's running on the cooccurrence extraction: $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE

I can see the file cooccurrence.bin growing slowly, but I was wondering if it is normal that it's running for such long time?

-rw-r--r-- 1 lfoppian0 tdm 631G Sep 8 08:08 cooccurrence.bin

Thank you in advance

For information, I'm attaching the modified the script demo.sh. Mainly I changed memory=80Gb, CPUs=72 and modified various paths and file names:

#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

make

CORPUS=myCorpus.txt
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=80.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=72
X_MAX=10
if hash python 2>/dev/null; then
    PYTHON=python
else
    PYTHON=python3
fi

echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
echo "$ $PYTHON eval/python/evaluate.py"
$PYTHON eval/python/evaluate.py
~                                    
AngledLuffa commented 3 years ago

I've never trained glove, so I can't give a definitive answer, but I inherited this like Aunt May's old antique chair which I need to do something with.

Taking forever and doing things very slowly sounds like classic thrashing behavior. Is it using a lot of swap space and/or maxxed out on the 80G you gave it?

lfoppiano commented 3 years ago

I understand the situation. Thanks anyway for answering. :-)

I checked and things seem fine (no swapping, some memory is still free). However, after checking this I stopped and restarted with verbose = 0 and with more memory, although I don't think the memory was the issue there.

lfoppiano commented 2 years ago

I can confirm that restarting the process with verbose=0 and more memory finished in much less time

AngledLuffa commented 2 years ago

Awesome, thanks!

On Sun, Oct 3, 2021 at 10:43 PM Luca Foppiano @.***> wrote:

I can confirm that restarting the process with verbose=0 and more memory finished in much less time

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/196#issuecomment-933160290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMKX6LMNPQH7HD23QDUFE5GXANCNFSM5DTLKARA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.