[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing!

truskovskiyk commented 1 month ago

Hey team,

I am having issues with large datasets (~10k samples or more).

Calling the make_last_position_supervised_data_module function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.

Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.

Could we add an option to perform this operation in batch mode?

I am happy to send a PR with this change.

frankaging commented 1 month ago

@truskovskiyk thanks! feel free to submit a PR for that --- that would be great!

frankaging commented 1 month ago

priority set to P0, and assign to @truskovskiyk for the PR.

stanfordnlp / pyreft

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85