hlp-ai / mt-data

MT Data
Apache License 2.0
1 stars 2 forks source link

Align sentendes of a Web site based on cross-lingual sentence embeddings #3

Open hlp-ai opened 1 year ago

hlp-ai commented 1 year ago

For sentences in a Wet site, align them based on cross-lingual sentence embeddings, e.g., LaBSE. So, the basic steps are as follows:

  1. For pages in a Web site, segment text into sentences;
  2. Convert sentences into dense vectors using LaBSE;
  3. Find the most similar senteces. i.e., parallel sentences, based on embedding vectors.
hlp-ai commented 1 year ago

See CCMatrix for better preformance.