Closed varisd closed 2 months ago
DecontaminateStep currently processes each corpus individually as a monolingual corpus which can result in removing varying number of lines across a dataset languages.
Solved in PR #37
DecontaminateStep currently processes each corpus individually as a monolingual corpus which can result in removing varying number of lines across a dataset languages.