I was trying to train the latest Wikipedia dump size 15gb, obviously it has large corpus and token count (approx 360m). Since the co-occurrence matrix need to live in the memory, I want to provide a min number for Freq count of the word while creating vocab which in turn creates the co-occurrence matrix. I could not find any parameter for that. Also the code is in cython so it's hard to understand for noob like me. Any idea how can I create vocab and co-occurrence making it memory efficient?
I was trying to train the latest Wikipedia dump size 15gb, obviously it has large corpus and token count (approx 360m). Since the co-occurrence matrix need to live in the memory, I want to provide a min number for Freq count of the word while creating vocab which in turn creates the co-occurrence matrix. I could not find any parameter for that. Also the code is in cython so it's hard to understand for noob like me. Any idea how can I create vocab and co-occurrence making it memory efficient?