microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.
MIT License
1.84k stars 129 forks source link

A list of feature requests for BlingFire #129

Open GeorgeS2019 opened 3 years ago

GeorgeS2019 commented 3 years ago

Through a recent evaluation of the feasibility of using BlingFire to tokenize GPT2 for .NET, it seems practical that there is need for interoperability of BlingFire with Tensor Text manipulation through a .NET library.

This issue aims to gather feedback, as there are potential new .NET users here who are interested of deep NLP to consider using dotnet/TorchSharp for interoperability with BligFiure, in the same spirit as use cases in PyTorch.

For these .NET users, one tentative idea is to look at NLP features provided PyTorch/Text to do an evaluation that many of the PyTorch.Text NLP functionalities have already provided by BlingFire and perhaps with better performance.

We need feedback, by looking through the functionalities provided by PyTorch/Text and make these PyTorch NLP features (through BlingFire) available in TorchSharp.

==> Likewise, these unmet .NET NLP features found in PyTorch/Text could provide ideas/inspiration what else to develop to improve BlingFire

Requests

Could BlingFire address all the tokenization needs listed here by Onnxruntime.Extension

image

SergeiAlonichau commented 3 years ago

I think BlingFire can solve most of the tokenization ops needs, whatever is missing please let me know I can add. It would be great to see BlingFire integrated into ONNX Extension Ops.

GeorgeS2019 commented 3 years ago

@SergeiAlonichau After your feedback, I dig into the codes: => BlingFire already integrated into ONNX Extension Ops

I wonder what else (from BlingFire) can be integrated. Has or should BlingFire being integrated and available from the distributed ortcustomops.dll?

Exported function

Still exploring => can BlingFire learn from the implemented tokenizer to improve .NET tokenizer experience?

GeorgeS2019 commented 2 years ago

@SergeiAlonichau further update