prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit
MIT License
174 stars 32 forks source link

Alternative to installing sentencpiece #56

Open Sreyan88 opened 1 year ago

Sreyan88 commented 1 year ago

Hello!

Just wanted to know if there is an alternative to installing sentencepiece. I see it requires sudo access adding to me getting the following error:

(bart_pretraining) make install [ 10%] Built target sentencepiece_train-static Consolidate compiler generated dependencies of target sentencepiece-static [ 46%] Built target sentencepiece-static Consolidate compiler generated dependencies of target sentencepiece [ 82%] Built target sentencepiece Consolidate compiler generated dependencies of target spm_decode [ 84%] Built target spm_decode Consolidate compiler generated dependencies of target sentencepiece_train [ 93%] Built target sentencepiece_train Consolidate compiler generated dependencies of target spm_normalize [ 95%] Built target spm_normalize Consolidate compiler generated dependencies of target spm_train [ 97%] Built target spm_train Consolidate compiler generated dependencies of target spm_export_vocab [ 99%] Built target spm_export_vocab Consolidate compiler generated dependencies of target spm_encode [100%] Built target spm_encode Install the project... -- Install configuration: "" CMake Error at cmake_install.cmake:46 (file): file cannot create directory: /usr/local/lib64/pkgconfig. Maybe need administrative privileges.

when I run make -j $(nproc)

prajdabre commented 1 year ago

Since you are only fine tuning a Bart model there's no need to train your own vocab and thus sentencepiece binary installation is unnecessary.

prajdabre commented 1 year ago

However should you need to install sentencepiece then you may install it in your ~/.local folder.

To do so, you can specify a cmake install directory as in https://confluence.ecmwf.int/plugins/servlet/mobile?contentId=38076656#content/view/38076656

singhakr commented 1 year ago

I have a similar problem. I want to use this toolkit primarily to be able to pre-train models from scratch, so I might need sentencepiece. The error I am getting on trying to install it is:

Target "sentencepiece_train" requires the language dialect "CXX17"

I also don't have root access, so I can't install from source, even to a local folder, it seems. However, sentencepiece Python wrapper seems to be installed. In that case, can't I simply write a Python script to call spm_train from this version and point to this script for the tokenizer script, instead of calling spm_train separately? I haven't yet checked if the Python wrapper is working, but I will try it.

prajdabre commented 1 year ago

Hi,

The error you get is related to sentencepiece_train aka spm_train and not YANMTT. The python wrapper calling spm_train will need the core sentencepiece library installed and running.

Since this is an issue related to sentencepiece and unrelated to YANMTT, I recommend asking for a solution on the sentencepiece repo.

prajdabre commented 1 year ago

You don't need root access to install to a local folder btw.

singhakr commented 1 year ago

I understand I can install in local directory, but the error is during the build process. It seems a problem related to the installed versions of build tools.

I thought if sentencepiece is installed via the wheel package, it might be already built and available for the system I am working on.

I will ask on the sentencepiece repo. Thanks anyway for replying promptly.

Although initially I was trying to work via Huggingface directly, installing the toolkit might give me the option to play with the internals (loss, hyperparameters or even perhaps the architecture) more easily, although I am relatively new to programming DL systems, so it may be a bit more difficult.

prajdabre commented 1 year ago

Ahh I have had issues with the build tools before. In such cases I also do local installs of the build tools as well. It's nightmarish. I don't envy you 😭

singhakr commented 1 year ago

I have posted the issue on the sentencepiece repo and am waiting for someone to reply. I have tried the Python wrapper and it seems to be working, but there is some locale or encoding related error in the output.

Specifically, I tried the code from here:

https://notebook.community/google/sentencepiece/python/sentencepiece_python_module_example

In the output, the underscore character is printed as question mark. I have tried setting the encoding (in the Python code file) and locale (on the shell) to UTF-8.

# -*- coding: utf-8 -*-

The output I am getting is:

['�This', '�is', '�a', '�t', 'est']
[212, 32, 10, 587, 446]
['_This', '_is', '_a', '_t', 'est']
_This_is_a_test
This is a test

So, I wonder whether this is simply a matter of presentation/rendering due to the way BPE works.

prajdabre commented 1 year ago

Hi,

I'm not sure what each output comes from what. I will need more info to answer your question.

Regards.

singhakr commented 1 year ago

To be more exact, I used this code with only the Python wrapper installed:

#!~/yanmtt/py36/bin/python
# -*- coding: utf-8 -*-
import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

str = 'I saw a boy with a telescope'

print('Input: ')
print(str)

# encode: text => id
pieces = sp.encode_as_pieces(str)
ids = sp.encode_as_ids(str)

print('Encoded pieces: ')
print(pieces)
print('Encoded ids: ')
print(ids)

# decode: id => text
print('Decoded from pieces: ')
print(sp.decode_pieces(pieces))
print('Decoded from ids: ')
print(sp.decode_ids(ids))

And the output I get is:

Input:
I saw a boy with a telescope
Encoded pieces:
['�I', '�sa', 'w', '�a', '�bo', 'y', '�with', '�a', '�', 'te', 'le', 's', 'c', 'op', 'e']
Encoded ids:
[6, 291, 89, 10, 448, 40, 26, 10, 9, 228, 126, 8, 82, 310, 20]
Decoded from pieces:
I saw a boy with a telescope
Decoded from ids:
I saw a boy with a telescope

So, since the sentence is correctly decoded, it should be a matter of display/rendering.

prajdabre commented 1 year ago

Hi

I agree with you.