rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Use Subword NMT inline in my python code #88

Closed ajesujoba closed 4 years ago

ajesujoba commented 4 years ago

I am training a NMT system where I need to apply BPE online as I train(don't mind my approach). I have my BPE model, and the sentences to encode. Going through the documentation here , on installing subword NMT with pip install subword-nmt, there are no inline python functions to use. Or am I wrong? Are there inline python functions such as applybpe(bpemodel,....) that I can use?

rsennrich commented 4 years ago

yes, you can apply BPE inline using the BPE class:

from subword_nmt.apply_bpe import BPE

check the bottom of apply_bpe.py on how to initialize the class: https://github.com/rsennrich/subword-nmt/blob/75a69fc153c9e71b1436c36f939cb772d3382fc4/subword_nmt/apply_bpe.py#L364

you can then use the method process_line to apply BPE: https://github.com/rsennrich/subword-nmt/blob/75a69fc153c9e71b1436c36f939cb772d3382fc4/subword_nmt/apply_bpe.py#L367

ajesujoba commented 4 years ago

Thank you very much @rsennrich .

rsennrich commented 4 years ago

yes: the read_vocabulary function can be used to load your vocabulary from a file and apply some frequency threshold. You can then pass the resulting set the the initializer of BPE.

https://github.com/rsennrich/subword-nmt/blob/75a69fc153c9e71b1436c36f939cb772d3382fc4/subword_nmt/apply_bpe.py#L353

On 28/04/2020 18:28, ALABI Jesujoba Oluwadara wrote:

Hi @rsennrich https://github.com/rsennrich, sorry for disturbing you. Is there a way I can also add vocabulary threshold in my case as described earlier? Sorry for the inconveniences

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rsennrich/subword-nmt/issues/88#issuecomment-620715056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCXSYO35Y235NF6N23MXTRO372LANCNFSM4MMNBKUA.