I think representations should not use obscure types like uint8 and uint16. This seems to be uncommon in Python and could cause e.g. overflow bugs for unsuspecting users (let's say if they want to offset the IDs to add custom tokens). Also, PyTorch doesn't support uint16.
I think the default NumPy int type, np.int_ (C long), would be a good default. Or at least a more common (signed) type. Or maybe plain Python lists should be used instead of NumPy arrays.
We can offer a dtype parameter (either for each representation, or for functions like to_pytorch_dataset, or both) in case the user wishes to save space.
I think representations should not use obscure types like
uint8
anduint16
. This seems to be uncommon in Python and could cause e.g. overflow bugs for unsuspecting users (let's say if they want to offset the IDs to add custom tokens). Also, PyTorch doesn't supportuint16
.I think the default NumPy int type,
np.int_
(Clong
), would be a good default. Or at least a more common (signed) type. Or maybe plain Python lists should be used instead of NumPy arrays.We can offer a
dtype
parameter (either for each representation, or for functions liketo_pytorch_dataset
, or both) in case the user wishes to save space.