google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers
Apache License 2.0
710 stars 77 forks source link

Nested Structure in ListOps #14

Closed adamsolomou closed 3 years ago

adamsolomou commented 3 years ago

Hi,

According to a previous issue #6, the bracketed parentheses '[', ']' should be preserved during tokenization, resulting to the following set of unique tokens

( ']', '1', '4', '7', '9', '8', '2', '5', '3', '0', '6', '[MIN', '[SM', '[MAX', '[MED']).

However, based on your code (input_pipeline.py), the following set of tokens is used to encode the input

('SM', '2', 'MAX', '8', '5', '0', '4', '1', '7', '6', '3', 'MIN', '9', 'MED').

Evidently, if you run

encoder.decode(x)

where x is an encoded input sample, the result does not contain the bracketed parentheses '[', ']. For instance, the following is the (truncated) outcome of decoding the first encoded sample in the training set

MIN 1 MED 8 0 1 3 0 6 1 7 MED SM 4 MED SM 2 9 MED 9 9 2 MIN 8 3 5 4 8 5 6 2 6 0 5 8 2 6 8 9 8 3 4 SM 3 SM MIN MIN MED 5 8 7 9 1 7 1 8 8 MAX 2 5 7 1 1 0 MAX 6 2 5 0 MIN 2 2 4 8 7 1 MIN 1 SM 3 6 4 0 5 5 MED 6 3 8 0 5 6 4 3 7 8 0 6 8 8 3 MAX 5 1 3 5 MED 3 SM 3 2 6 MAX 2 MIN 8 MAX 8 2 5 8 7 1 5 9 0 0 9 MAX 8 1 MAX 5 9 4 1 1 3 2 3 2 2 MIN 1 9 0 2 9 3 8 MAX ....

So I was wondering why the bracketed parentheses '[', ']' are ignored? This way the nested structure is no longer preserved.

vanzytay commented 3 years ago

The brackets are required. This issue has been fixed internally and we are going to push an update to this codebase soon (In Jan). Since this public version acts as a mirror, we have not sync-ed for awhile now.

adamsolomou commented 3 years ago

Thanks for clarifying!

So the correct way to go would be to tokenize according to this vocabulary

( ']', '1', '4', '7', '9', '8', '2', '5', '3', '0', '6', '[MIN', '[SM', '[MAX', '[MED'])?

cifkao commented 3 years ago

I suppose the fix would be to pass something like reserved_tokens=['[', ']'] to tfds.deprecated.text.Tokenizer to prevent it from ignoring these?

adamsolomou commented 3 years ago

If you address the issue this way it gives rise to the following vocabulary

('MED', '5', '8', '4', '0', ']', '1', '3', '6', '7', '2', 'MIN', '9', '[', 'MAX', 'SM')

which differs from this

( ']', '1', '4', '7', '9', '8', '2', '5', '3', '0', '6', '[MIN', '[SM', '[MAX', '[MED')

in the sense that [ is encoded as a token itself rather than being part of '[MIN', '[SM', '[MAX', '[MED'.

@vanzytay could you please clarify which is the right way to go?