pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.49k stars 815 forks source link

Optionally ignore utf-8 decoding error when converting std::string to python str. #2126

Closed shuminghu closed 1 year ago

shuminghu commented 1 year ago

Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

facebook-github-bot commented 1 year ago

This pull request was exported from Phabricator. Differential Revision: D43970697

facebook-github-bot commented 1 year ago

This pull request was exported from Phabricator. Differential Revision: D43970697

facebook-github-bot commented 1 year ago

This pull request was exported from Phabricator. Differential Revision: D43970697

facebook-github-bot commented 1 year ago

This pull request was exported from Phabricator. Differential Revision: D43970697

facebook-github-bot commented 1 year ago

This pull request was exported from Phabricator. Differential Revision: D43970697