transducens / LASERtrain

21 stars 5 forks source link

LASERtrain

This package reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings. The authors have released a large model covering 93 languages as part of the LASER project; however the code used to train them remains unreleased. The code in this repository is an approximation to the actual code implemented by the authors using the description of the architecture and training parameters provided in their recent publications.

At the moment, the models produced with this software are not compatible with the models available in LASER project; this limitation will be tackled iin the near future.

The package includes instructions to reproduce the experiments described in Artetxe and Schwenk (2019) in which a model is trained on the UN v1.0 corpus and evaluated on the data released for the BUCC'18 shared task.

Requirements

The following packages are required to reproduce run this package and reproduce the results reported:

Tutorial: train and evaluate your LASER model

In this section, we reproduce the experiments carried out by Artetxe and Schwenk (2019).

Download and prepare data

First step is to download the datasets needed to train and evaluate our model. Two datasets are required:

For BUCC 2018, download all the 4 training data packages and the 4 test data packages to the sub-directory data. Once downloaded, uncompress all the packages using the coommand: tar xjf bucc2018-ru-en.test.tar.bz2

Acknowledgements

Developed by Universitat d'Alacant as part of its contribution to the GoURMET project, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299.

References