Closed nelson-liu closed 7 years ago
I would create two fields with different tokenizers, then use something like I used for the SNLI loader with trees to use both of the fields with the same input data column. But you’re right, there may be a more elegant way if you’re willing to modify the core code.
Thanks for the response @jekbradbury , that does indeed sound like it would work.
I would create two fields with different tokenizers, then use something like I used for the SNLI loader with trees to use both of the fields with the same input data column.
To be clear, are you referring to this area of text/snli.py? Let me know if i'm getting this right: you're basically using two fields (text_field
and parse_field
) on the sentence1_binary_parse
key of the JSON dataset?
Assuming I got that correct: how would you extend this to a TSV or a CSV?
Sorry if these are obvious questions, and thanks again.
I also got stuck at this problem.
As @jekbradbury described, just two different fields each with a different tokenization, processing respectively can be used but when using a vocabulary for the character field, it seems like the build_vocab
method of the Field
class is not laid out to handle lists of character lists which is also obvious from the code.
Would it be necessary to change the code so build_vocab
is also able to handle lists (of characters) or is there any other way to get it work?
I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.
I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.
Have you solved this yet? I am trying to implement word-level combined with char-level using torchtext too
Hi, Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn't seem supported yet but I may have missed something.
I'd be happy to contribute something to extend
torchtext
to support this, but I'm not sure what the proper way to handle this would be (ideally it'd be extensible to other tokenization schemes as well, but perhaps that's a stretch). Thoughts?Thanks!