Open Sreyan88 opened 1 year ago
Since you are only fine tuning a Bart model there's no need to train your own vocab and thus sentencepiece binary installation is unnecessary.
However should you need to install sentencepiece then you may install it in your ~/.local folder.
To do so, you can specify a cmake install directory as in https://confluence.ecmwf.int/plugins/servlet/mobile?contentId=38076656#content/view/38076656
I have a similar problem. I want to use this toolkit primarily to be able to pre-train models from scratch, so I might need sentencepiece. The error I am getting on trying to install it is:
Target "sentencepiece_train" requires the language dialect "CXX17"
I also don't have root access, so I can't install from source, even to a local folder, it seems. However, sentencepiece Python wrapper seems to be installed. In that case, can't I simply write a Python script to call spm_train from this version and point to this script for the tokenizer script, instead of calling spm_train separately? I haven't yet checked if the Python wrapper is working, but I will try it.
Hi,
The error you get is related to sentencepiece_train aka spm_train and not YANMTT. The python wrapper calling spm_train will need the core sentencepiece library installed and running.
Since this is an issue related to sentencepiece and unrelated to YANMTT, I recommend asking for a solution on the sentencepiece repo.
You don't need root access to install to a local folder btw.
I understand I can install in local directory, but the error is during the build process. It seems a problem related to the installed versions of build tools.
I thought if sentencepiece is installed via the wheel package, it might be already built and available for the system I am working on.
I will ask on the sentencepiece repo. Thanks anyway for replying promptly.
Although initially I was trying to work via Huggingface directly, installing the toolkit might give me the option to play with the internals (loss, hyperparameters or even perhaps the architecture) more easily, although I am relatively new to programming DL systems, so it may be a bit more difficult.
Ahh I have had issues with the build tools before. In such cases I also do local installs of the build tools as well. It's nightmarish. I don't envy you 😭
I have posted the issue on the sentencepiece repo and am waiting for someone to reply. I have tried the Python wrapper and it seems to be working, but there is some locale or encoding related error in the output.
Specifically, I tried the code from here:
https://notebook.community/google/sentencepiece/python/sentencepiece_python_module_example
In the output, the underscore character is printed as question mark. I have tried setting the encoding (in the Python code file) and locale (on the shell) to UTF-8.
# -*- coding: utf-8 -*-
The output I am getting is:
['�This', '�is', '�a', '�t', 'est']
[212, 32, 10, 587, 446]
['_This', '_is', '_a', '_t', 'est']
_This_is_a_test
This is a test
So, I wonder whether this is simply a matter of presentation/rendering due to the way BPE works.
Hi,
I'm not sure what each output comes from what. I will need more info to answer your question.
Regards.
To be more exact, I used this code with only the Python wrapper installed:
#!~/yanmtt/py36/bin/python
# -*- coding: utf-8 -*-
import sentencepiece as spm
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
str = 'I saw a boy with a telescope'
print('Input: ')
print(str)
# encode: text => id
pieces = sp.encode_as_pieces(str)
ids = sp.encode_as_ids(str)
print('Encoded pieces: ')
print(pieces)
print('Encoded ids: ')
print(ids)
# decode: id => text
print('Decoded from pieces: ')
print(sp.decode_pieces(pieces))
print('Decoded from ids: ')
print(sp.decode_ids(ids))
And the output I get is:
Input:
I saw a boy with a telescope
Encoded pieces:
['�I', '�sa', 'w', '�a', '�bo', 'y', '�with', '�a', '�', 'te', 'le', 's', 'c', 'op', 'e']
Encoded ids:
[6, 291, 89, 10, 448, 40, 26, 10, 9, 228, 126, 8, 82, 310, 20]
Decoded from pieces:
I saw a boy with a telescope
Decoded from ids:
I saw a boy with a telescope
So, since the sentence is correctly decoded, it should be a matter of display/rendering.
Hi
I agree with you.
Hello!
Just wanted to know if there is an alternative to installing sentencepiece. I see it requires sudo access adding to me getting the following error:
when I run
make -j $(nproc)