pkufool / simple-sentencepiece

A simple sentencepiece encoder and decoder
Apache License 2.0
2 stars 0 forks source link

Decode Pieces #1

Open iprovalo opened 2 months ago

iprovalo commented 2 months ago

Is there a way to invoke this module similar to the google's sentencepiece Decode?

  // Given a sequence of pieces, decodes it into a detokenized output.
  virtual util::Status Decode(const std::vector<std::string> &pieces,
                              std::string *detokenized) const;

I have an original spm model which I converted to bpe in python:

        input_dir = "source.spm"
        output_dir = "bpe.vocab"

        sp = spm.SentencePieceProcessor(model_file=input_dir)

        vocab_size = sp.get_piece_size()
        tokens_and_scores = [(sp.id_to_piece(i), sp.get_score(i)) for i in range(vocab_size)]

        with open(output_dir, 'w') as f:
            for token, score in tokens_and_scores:
                f.write(f"{token} {score}\n")

but there is no API to accept string pieces like in the original API.

Thank you!

iprovalo commented 2 months ago

@pkufool I really like the idea of not having the protobuf dependencies in sentencepiece!

pkufool commented 2 months ago

@iprovalo Decodign string pieces into string is very simple, just cat them together, then replace "_" with spaces.

iprovalo commented 2 months ago

Thank you, @pkufool !!!