The Python150kExtractor currently deserializes all objects before splitting them using sklearn. This causes extreme memory usage even with the relatively small py150 dataset and prevents me from running it on a machine with 16 GB of RAM. This PR moves the deserialization after the splitting so that only the serialized objects need to be in memory all at once.
The Python150kExtractor currently deserializes all objects before splitting them using sklearn. This causes extreme memory usage even with the relatively small py150 dataset and prevents me from running it on a machine with 16 GB of RAM. This PR moves the deserialization after the splitting so that only the serialized objects need to be in memory all at once.