Improve memory usage of Python150kExtractor

tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"

http://code2seq.org

MIT License

555 stars 164 forks source link

Improve memory usage of Python150kExtractor #125

Closed alexhorn closed 2 years ago

alexhorn commented 2 years ago

The Python150kExtractor currently deserializes all objects before splitting them using sklearn. This causes extreme memory usage even with the relatively small py150 dataset and prevents me from running it on a machine with 16 GB of RAM. This PR moves the deserialization after the splitting so that only the serialized objects need to be in memory all at once.

urialon commented 2 years ago

Thanks @alexhorn for your contribution!