rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Trouble with a JavaScript Corpus #78

Closed shamoons closed 4 years ago

shamoons commented 5 years ago

My corpus isn't a natural language, but rather JavaScript code. I am running:

subword-nmt learn-bpe -s 1000 < data/javascript.txt > data/jsvocab.txt
subword-nmt apply-bpe -c data/jsvocab.txt < data/javascript.txt > data/jsbpe.txt

But then my jsbpe.txt file has a ton of @@. I don't think it's really splitting into the encodings. Am I misunderstanding how to use this?

mpatsis commented 5 years ago

data/jsvocab.txt will contain the learnt BPE merge operations. data/jsbpe.txt will contain tokenized code. When an identifier is split into subtokens then @@ signifies that there are more subtokens. When no @@ appears then that signifies that this is the last subtoken (or that you have a full token). For instance Random -> Rand@@ om

As I have a lot of experience with using BPE on code corpora from a recent paper I would also advice you to filter non-ascii sequences. You can use this script (non-ascii_sequences_to_unk.py) here: https://github.com/mast-group/OpenVocabCodeNLM It will help with making your model smaller, faster and more insensitive to noisy characters (e.g. Chinese characters in strings)

shamoons commented 5 years ago

Thank you for the prompt response. I am trying to learn more about NLP, specifically with source code and am having some struggles. First, I read your paper (thank you for your efforts!) and it was very helpful. What I'm trying to do is create a somewhat ideal set of tokens given some code corpus. BPE seems to be a pretty ideal candidate since it can be language agnostic (spaces matter, don't matter, etc,). However, most tokenizers that I see split on space, which may not be ideal for source code (Python, Ruby, etc).

In jsbpe, I have something like:

'@@ api-@@ projects': '@@ AP@@ I Project@@ s',
'basic-@@ c@@ ss@@ ': 'Basic CS@@ S',

So I assume the @@ is splitting the various tokens. I'm confused about the space, however. You see that "@@ api-@@" has a space, when in my original training file, there was no space before the "api-".

Second, I don't fully understand the vocabulary file. It has something like: s s por t m o e x d a c ache C ount tive Donation

I don't get what the 2 columns are supposed to be. Is it one token or 2?

Thanks again for a great library and I'll definitely check the OpenVocab for code.

Shamoon Siddiqui MS, MBA, PhD (in progress) "We are what we repeatedly do. Excellence, therefore, is not an act but a habit."

On Tue, Jun 18, 2019 at 2:00 PM Rafael-Michael Karampatsis < notifications@github.com> wrote:

data/jsvocab.txt will contain the learnt BPE merge operations. data/jsbpe.txt will contain tokenized code. When an identifier is split into subtokens then @@ signifies that there are more subtokens. When no @@ appears then that signifies that this is the last subtoken (or that you have a full token). For instance Random -> Rand@@ om

As I have a lot of experience with using BPE on code corpora from a recent paper I would also advice you to filter non-ascii sequences. You can use this script (non-ascii_sequences_to_unk.py) here: https://github.com/mast-group/OpenVocabCodeNLM It will help with making your model smaller, faster and more insensitive to noisy characters (e.g. Chinese characters in strings)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rsennrich/subword-nmt/issues/78?email_source=notifications&email_token=AAHY6HOQEZEPIJ2DXLEWMHTP3DE4DA5CNFSM4HY5JI52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX6ESZI#issuecomment-503073125, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHY6HPBDA76AR3THAXF6WDP3DE4DANCNFSM4HY5JI5Q .

mpatsis commented 5 years ago

The best way to tokenize code would probably be to use a lexer (e.g. for JavaScript: https://github.com/aaditmshah/lexer) for the corresponding programming language. Lexical analysis is the first phase of any compiler. You can further split those into subword (or subtokens by learning BPE). In the example you provided its splitting 'API into '@@ AP@@ I and Projects; into Project@@ s'. Basically what happened is that when it saw space it considered 'API Projects' as two tokens 'API and Projects'. The reason is that the input is assumed to already be tokenized with tokens split by space. You could avoid these by either replacing spaces inside strings with a special character or character sequence or by replacing long strings with spaces with the empty string. It depends on what you want to do later. 'api-projects' has no spaces so it was considered one token and resulted in 3 subwords '@@ api-@@ projects'

For your second question: The two columns represent the symbol merge operations BPE learned. For instance por t means that por and t was a pair of two symbols already in vocabulary that appear together most often at the (numberOfLine)th iteration. Thus, the symbol port was added in the vocabulary. Basically, BPE starts with a vocabulary of characters and expands it with the above criterion. So if you learn 1000 operations you end up with a vocabulary of 1000 + initial. It doesn't really tell you what the vocabulary is but it should be trivial to get it.