Feature Request: Word+Character-level tokenization

pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch

https://pytorch.org/text

BSD 3-Clause "New" or "Revised" License

3.51k stars 811 forks source link

Feature Request: Word+Character-level tokenization #37

Closed nelson-liu closed 7 years ago

nelson-liu commented 7 years ago

Hi, Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn't seem supported yet but I may have missed something.

I'd be happy to contribute something to extend torchtext to support this, but I'm not sure what the proper way to handle this would be (ideally it'd be extensible to other tokenization schemes as well, but perhaps that's a stretch). Thoughts?

Thanks!

jekbradbury commented 7 years ago

I would create two fields with different tokenizers, then use something like I used for the SNLI loader with trees to use both of the fields with the same input data column. But you’re right, there may be a more elegant way if you’re willing to modify the core code.

nelson-liu commented 7 years ago

Thanks for the response @jekbradbury , that does indeed sound like it would work.

I would create two fields with different tokenizers, then use something like I used for the SNLI loader with trees to use both of the fields with the same input data column.

To be clear, are you referring to this area of text/snli.py? Let me know if i'm getting this right: you're basically using two fields (text_field and parse_field) on the sentence1_binary_parse key of the JSON dataset?

Assuming I got that correct: how would you extend this to a TSV or a CSV?

Sorry if these are obvious questions, and thanks again.

kklemon commented 6 years ago

I also got stuck at this problem.

As @jekbradbury described, just two different fields each with a different tokenization, processing respectively can be used but when using a vocabulary for the character field, it seems like the build_vocab method of the Field class is not laid out to handle lists of character lists which is also obvious from the code.

Would it be necessary to change the code so build_vocab is also able to handle lists (of characters) or is there any other way to get it work?

oya163 commented 5 years ago

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

binhna commented 5 years ago

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

Have you solved this yet? I am trying to implement word-level combined with char-level using torchtext too