[REVIEW]Optimize Embedding Creation by 4.8x

This PR optimized embedding creation

Benchmarks

131 s now vs 634 s previously on Mainline
Rapids is 131 s now vs Sentence Transformers 175 s (Due to faster Rapids tokenization)

The core improvement here is that we now clip the extra zeros at the end of the input to BERT to remove redundant DL model operations.

Todo:

rapidsai / rapids-examples