Rewrite proposal generation from CUDA to Python
Thanks to https://github.com/apache/incubator-mxnet/pull/14363, MXNet CustomOp now works parallelly. This enables us to migrate more CUDA codes to Python to ease the modification of core operators.
[ ] wrap the contrib RoIAlign for our use
Our one is written in quite an early day and is slightly different from the official Detectron implementation. Also, ours lacks FP16 support, sampling ratio, and PS variant.