rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.34k stars 888 forks source link

[FEA] Allow user to specify "removed key" value for nvcategory.set_keys #3113

Closed randerzander closed 4 years ago

randerzander commented 5 years ago

Users sometimes want control over the sentinel value used to replace a value outside the list of "allowed" categorization keys.

nvcategory.set_keys should include an int param na_rep to be used instead of -1. The implication being that encoded values in the "allowed" key list must never be encoded as na_rep.

Example:

strs = nvstrings.to_device([''dog', 'fish', 'cat'])
cat = nvcategory.from_strings(strs)
print("Original encoding:")
print(cat.values())

print("Re-encoded:")
cat = cat.set_keys(['dog', 'fish'], na_rep=0)
print(cat.values())

Result:

Original encoding:
0, 1, 2

Re-encoded:
1, 2, 0
davidwendt commented 5 years ago

This would be significant rework of nvcategory code. The values are indexes into the key list. If this is no longer the case as in the example above, all code logic that assumes this would now need to be adjusted for the new artificial value. In the provided example, the original keys are [cat, dog, fish] and the values are [1,2,0] -- the values are simply indexes into the key list.

The set_keys(['dog','fish']) call returns an nvcategory instance with keys [dog,fish] and values [0,1,-1]. If we add an na_rep value that maps to any existing key index (0 or 1 in this case) then the values need to be moved around this arbitrary value meaning the values no longer point to the keys index. All the code heavily relies on the indexes being correct (mapping to valid keys) or being -1.

We could introduce an additional map between the artificial values and the real indexes but this could increase the overhead significantly per nvcategory instance. The largest array of an nvcategory should be the values array which is the same number as the original strings list it was created from. The keys list should be small in comparison. An additional map would require another array of the same size as the larger values array in order to map all the possible values.

EvenOldridge commented 5 years ago

To be clear the reason for this request in neural networks is that these encodings are used to look up memory locations for the categorical embeddings. -1 isn't a valid memory location so most libraries, but fastai specifically sets the null value to 0. I'm not sure whether it needs to be so flexible that any value can be used. A boolean 'is_neuralnet' or some other name that defaults to False would be fine too.

davidwendt commented 5 years ago

So a null key is valid and will always have an index value of 0. The -1 does not indicate a null entry but an entry that has no key (not even the null key). The -1 isn't a valid memory location for nvcategory as well.

s = nvstrings.to_device(['dog','fish','cat',None])
c = nvcategory.from_strings(s)
print(c.keys(),c.values())
[None, 'cat', 'dog', 'fish'] [2, 3, 1, 0]

But this key can be removed just like any other key

s2 = nvstrings.to_device(['dog','fish'])
c2 = c.set_keys(s2)
print(c2.keys(),c2.values())
['dog', 'fish'] [0, 1, -1, -1]

Here the null key has been abandoned as well as the cat key so both values are now assigned to -1.

If you wish to re-assign the values for the new keys you can always use the gather method to force a new set of values:

s = nvstrings.to_device(['dog','fish','cat'])
c = nvcategory.from_strings(s)
print(c.keys(),c.values())
['cat', 'dog', 'fish'] [1, 2, 0]

Introduce a null key place holder during the set_keys call:

s2 = nvstrings.to_device(['dog','fish',None])
c2 = c.set_keys(s2)
print(c2.keys(),c2.values())
[None, 'dog', 'fish'] [1, 2, -1]

Call the gather method with the new values set to valid key indexes. Essentially I think this is just taking the c2.values() array and replacing all the -1 values with 0. Since the null key exists and will always have index value 0.

c3 = c2.gather([1,2,0])
print(c3.keys(),c3.values())
[None, 'dog', 'fish'] [1, 2, 0]
EvenOldridge commented 5 years ago

Do we force a null key if it doesn't exist in the data? My understanding is that there has to be a null value in the data for this to work.

This further complicates the problem of remapping to legitimate memory values because now we have to conditionally either map the -1 values to 0 or add 1 to every value depending on whether there was a null in the original set.

I understand that it's possible to reassign the key values but I'm hoping we can provide the option to handle that syntactic sugar for the user so they don't have to remember to do it every time. This is an operation that everyone doing tabular deep learning, nlp, etc use and it would be great to support them directly.

davidwendt commented 5 years ago

The set_keys actually does the remap automatically. Here is the same example but we get rid of dog instead of cat:

s2 = nvstrings.to_device(['cat','fish',None])
c2 = c.set_keys(s2)
print(c2.keys(),c2.values())
[None, 'cat', 'fish'] [-1, 2, 1]

Note that cat has moved from 0 to 1 and fish is still at 2. This is because null is introduced at 0 requiring the remap. Again, the caller can decide that -1 means the same as null for them and re-assign all the -1 values to 0 and the call gather to create the new nvcategory.

Also, this should work regardless if null exists in the original set. Just make sure it is included in the set_keys call so it can be mapped.

randerzander commented 5 years ago

@EvenOldridge does this work for you, functionally?

We discussed awhile back, but refactoring our string APIs to be more direct Pandas mappings is pending merging into cuDF, and a follow on refactor of the libcudf column type to support things like strings and categories directly.

randerzander commented 5 years ago

@EvenOldridge fwiw cudf #2382 may already have what you need.

EvenOldridge commented 5 years ago

Thanks @randerzander 2382 looks like it offers the functionality we were looking for. For our use case we may also just use a custom variant that hides these operations from the user.

kkraus14 commented 4 years ago

Closing as nvcategory is deprecated as of 0.14