Bug fixes: offloading with a single MFC and parameter spec for ReaLModel.

The major change:

Refactor key enumeration and parameter count for ReaLModel.

Original functions are grouped into classes to make the code more readable.

Now the parameter count is automatically derived from the key and the correspnding tensor-parallel shape. It brings two benefits: (1) we don't need to maintain an additional counter function which can be error prone, and (2) the parameter count becomes accurate and it can be used everywhere else beyond just using it for partitioning pipeline stages.

Bug fixes:

Omit offloading when we have only one non-train MFC, e.g., generate.
Remove the sequence_parallel argument in TP partition functions since it is irrelevant to the TP partitioning strategy.

New features:

Add a pad_to_max_length option in prompt-answer and prompt dataset. Used purely for system-wise benchmarking purposes.

openpsi-project / ReaLHF

Bug fixes: offloading with a single MFC and parameter spec for ReaLModel. #50