ltgoslo / factorizer

GNU General Public License v3.0
14 stars 0 forks source link

How to train #1

Open jcuenod opened 9 months ago

jcuenod commented 9 months ago

Hi, thanks for sharing the code for both de/encoding and training. Could you put up a readme for training on new data? I would like to try this out on Greek and Hebrew text.

davda54 commented 9 months ago

Hi, thanks for your interest! I'm not sure if I'll have time to write a comprehensive training readme, but I'm happy to help you with training on these new languages! Please let me know here or on davisamu@ifi.uio.no if you have any issues.

The first thing you will need is a word-frequency list for each language. The Dataset class expect a tab-separated file with words sorted by frequency (f"{word}\t{frequency}"). Specifically, you should create these three files:

Evaluation of these models is not easy (without training an expensive language model), but the two validation files are at least somewhat useful for sanity checking the training.

jcuenod commented 9 months ago

Thanks! I'll give it a go when I have a chance, and email you if I get stuck :)

avi-jit commented 7 months ago

Hi @davda54 it appears there are missing dependencies for the following imports in vq-vae/train.py

from lazy_adam import LazyAdamW
from random_sampler import WeightedRandomSampler
avi-jit commented 7 months ago

Update: I was able to run train.py and get a model. How can we now convert this model to a .dawg file used in example code? @davda54

avi-jit commented 4 months ago

@davda54 reminder here a few months later. Could you please help us convert the trained model into a .dawg file?

davda54 commented 3 months ago

Just saw this comment, did you manage to make it run with the build.py file?