Processing in 100M chunks helps to prevent OOM and allows using a smaller machine.
It also makes it scalable to use with higher resource languages.
It completed successfully for the en-uk language pair that failed before: log.
We can also potentially implement a mechanism with starting from the last unprocessed part on pre-emption if we don't want to split it into multiple Taskcluster tasks to reduce the complexity of the graph.
Currently, I have to use standard instances because it can take several days on a larger student corpus (400M sentences):
Processing in 100M chunks helps to prevent OOM and allows using a smaller machine. It also makes it scalable to use with higher resource languages.
It completed successfully for the en-uk language pair that failed before: log.
We can also potentially implement a mechanism with starting from the last unprocessed part on pre-emption if we don't want to split it into multiple Taskcluster tasks to reduce the complexity of the graph.
Currently, I have to use standard instances because it can take several days on a larger student corpus (400M sentences):
Closes #721