Memory consumption in parallel mode

jeticg commented 8 years ago

While running the tagger in parallel, one couldn't help to discover that the first iteration usually takes significantly less time than the rest of the iterations. To be more specific, on my own tiny cluster, first iteration of training on penn2malt usually takes 3.4 hours, while the rest takes about 6-7 hours. Such difference exhibits signs of memory leaking.

The issue was even more severe with glm_parser. Unlike tagger which at least was able to successfully complete the test, multiple attempts to train on penn2malt on linearb all lead to outOfMemory error(with only 4 shards). When running it on my cluster it was even worse, all the dataNodes gave out outOfMemory error and crushed short after. I checked the memory usage and found all 32GB of memory on one of the nodes are all used up by the parser.

I suspect that this issue might be caused by our cython written parser and WeightVector(hvector). I haven't checked the memory management part of these implementations but I think it might also have something to do with spark.

Currently I have two ideas on addressing this issue:

We fix the cython implementation; or
instead of using cython for ceisener parser and hvector, we switch to native python for better memory management compatibility with spark and hadoop.

anoopsarkar commented 8 years ago

We don't know what the issue is so it is hard to fix. How about we first try to use solution 2 above to check if the memory leak is due to the use of Cython in ceisner. There is a pure python version of the Eisner algorithm in the parse directory. It may be out of date but it is supposed to be a drop in replacement for ceisner. Once we try that we can narrow down what is happening and the cause of the memory extravagance in spark.

jeticg commented 8 years ago

Eisner was long broken, haven't been used for a long time and I am trying to get it to work again. In the mean time I should let you know that results of parallel training using averaged perceptron for tagger shows an accuracy of 0.971084244146.

jeticg commented 8 years ago

Vivian tested it on linearb with 10GB per shard and it worked just fine, so I believe it's my mistake.

sfu-natlang / glm-parser

Memory consumption in parallel mode #57