Training ner on a a new corpus

KanwalSingh commented 9 years ago

Is there any memory leak? its taking a lot of memory for a very few training samples . Its getting killed after printing this

num feats in chunker model: 4095 train: precision, recall, f1-score: 0.984615 0.984615 0.984615 now do training num training samples: 198

I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.

davisking commented 9 years ago

That's just how the optimizer works. To do the training with any non-trivial amount of data you need to compile in 64bit mode and use a 64bit OS. Otherwise you can only use 2GB of ram which isn't very much.

KanwalSingh commented 9 years ago

@davisking the top command showed 13 GB memory usage, it got killed after that (safe to assume it was using the complete memory, hence got killed)

The config of my machine is 16gb memory , 64 bit , 2.1 ghz processor

davisking commented 9 years ago

13GB is a lot for 198 samples. How exactly did you run the trainer?

KanwalSingh commented 9 years ago

@davisking I used the total_word_feature_extractor.dat for the vocab file

trainer = ner_trainer(vocabfile)

for line in input_lines: sample = strip_braces(line) trainer.add(sample)

trainer.num_threads = 16 ner = trainer.train()

here the sample variable is of type ner_training_instance , this is how we are intitalising it line = "{india bulls :: builder} panvel greens" strip_line = "india bulls panvel greens" sample = ner_training_instance(strip_line.split()) sample.add_entity(xrange(0,1,"builder")

davisking commented 9 years ago

That doesn't look like it should run at all. The trainer.add() method is supposed to take a ner_training_instance object. Is that what strip_braces() returns?

Can you post a complete program that reproduces the problem? One that I can run?

KanwalSingh commented 9 years ago

ya, thats exactly what it returns

arjunmajum commented 9 years ago

@KanwalSingh the way you are labeling your training data has a bug...

sample.add_entity(xrange(0,1),"builder")

will label only "india" as builder, not "india bulls" as the line "{india bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2),"builder")

See here for documentation on the xrange function and here for example usage.

Please fix this bug and let us know if the problem persists.

KanwalSingh commented 9 years ago

Hi that's my bad in mentioning the label data , yes I am giving it as 0,2

On Monday, February 23, 2015, Arjun Majumdar notifications@github.com wrote:

@KanwalSingh https://github.com/KanwalSingh the way you are labeling your training data has a bug...

sample.add_entity(xrange(0,1,"builder")

will label only "india" as builder, not "india bull" as the line "{india bulls :: builder}..." would suggest. The correction is as follows:

sample.add_entity(xrange(0,2,"builder")

See here https://docs.python.org/2/library/functions.html#xrange for documentation on the xrange function and here http://www.pythoncentral.io/how-to-use-pythons-xrange-and-range/ for example usage.

Please fix this bug and let us know if the problem persists.

— Reply to this email directly or view it on GitHub https://github.com/mit-nlp/MITIE/issues/11#issuecomment-75545753.

Sent from My iPhone

davisking commented 9 years ago

What about the RAM usage. When I run the python trainer I don't get anything like 13GB of RAM usage. What happens when you run our provided train_ner.py python program? Does it use a lot of RAM or is only on your data that this happens?

manalgandhi commented 9 years ago

I am facing this problem too.

When I run the train_ner.py python program, it uses about 718 MB of ram (value under RES - top command).

When I run it on my data it uses 6.7 GB of ram (value under RES - top command).

Screenshots attached.

The python program was faster than the c++ program in determining the best C. But after determining the best C, the python program started consuming lot of ram (it did not comsume lot of ram until then). (Best C was calculated to be 300.69)

The c++ program took about 2.5 hours to determine the best C where as the python program took about an hour. I had to force shutdown the system after an hour or so after best C was determined in both the cases.

I've used the code under tools to build a custom total_word_feature_extractor.

This is the output from the python code before it starts determining the best C:

words in dictionary: 282 num features: 271 now do training C: 20 epsilon: 0.01 num threads: 4 cache size: 5 loss per missed segment: 3 C: 20 loss: 3 0.949386 C: 35 loss: 3 0.949386 C: 20 loss: 4.5 0.948403 C: 5 loss: 3 0.944963 C: 20 loss: 1.5 0.946437 C: 27.5 loss: 3.375 0.948894 C: 21.2605 loss: 3.35924 0.95086 C: 19.135 loss: 3.2257 0.949877 C: 22.119 loss: 3.19385 0.950369 C: 21.9391 loss: 3.60092 0.949877 C: 21.941 loss: 3.36495 0.95086 best C: 21.2605 best loss: 3.35924 num feats in chunker model: 4095 train: precision, recall, f1-score: 0.996075 0.997543 0.996808 now do training num training samples: 2043 ...

Machine Configuration: 8GB ram Intel i5 2.7GHz Ubuntu 12.04 64bit

Please let me know if I've made a mistake somewhere since this was the first time I've executed the program.

train_ner - custom data - python - top train_ner - python - top

davisking commented 9 years ago

Can you post the inputs you used to run this such that I can exactly reproduce this issue?

manalgandhi commented 9 years ago

I'm not sure if I am allowed to share the training data. I'll post it here if I am allowed to, on monday.

davisking commented 9 years ago

Sounds good.

The only reason I can think of that might cause this is if you use a very large number of labels. How many different label strings did you use? E.g. the example program uses just person and org so that's 2 different types of labels. If you used 1000 then it's going to take a huge amount of RAM because it solves a big multiclass linear SVM in the last step that uses an amount of RAM linear in the number of distinct labels.

manalgandhi commented 9 years ago

There are 8 different labels in the training data.

And the data contains about 600 sentences. Each sentence/phrase contains three to ten words.

davisking commented 9 years ago

That amount should be fine.

KanwalSingh commented 9 years ago

I too had 10 labels max

On Fri, Feb 27, 2015 at 6:10 PM, Manal notifications@github.com wrote:

There are 8 different labels in the training data.

— Reply to this email directly or view it on GitHub https://github.com/mit-nlp/MITIE/issues/11#issuecomment-76388618.

Kanwal Prakash Singh #Data Housing.com +919619281431

davisking commented 9 years ago

Can you post your training data? :)

We (and a bunch of other groups) have been using MITIE to train models and haven't had any issues. So I need one of you guys to post a program that reproduces the issue you are having or it's going to be impossible to debug :)

KanwalSingh commented 9 years ago

hey Davis, I understand that, will be sharing the training data by Monday, will also see if we can share the code with you. Hope thats fine with you, also share your personal email.

On Fri, Feb 27, 2015 at 6:22 PM, Davis E. King notifications@github.com wrote:

Can you post your training data? :)

We (and a bunch of other groups) have been using MITIE to train models and haven't had any issues. So I need one of you guys to post a program that reproduces the issue you are having or it's going to be impossible to debug :)

— Reply to this email directly or view it on GitHub https://github.com/mit-nlp/MITIE/issues/11#issuecomment-76389933.

Kanwal Prakash Singh #Data Housing.com +919619281431

davisking commented 9 years ago

Sounds good. You can email me at davis@dlib.net

manalgandhi commented 9 years ago

@davisking , I won't be able to share the training data. Sorry!

@KanwalSingh could you please share the training data you used with Davis.

KanwalSingh commented 9 years ago

@davisking I have mailed you the training data

davisking commented 9 years ago

Thanks. Please also include a working python program that when executed causes this large ram usage bug to appear so that I can debug it.

KanwalSingh commented 9 years ago

done, mailed it to you

On Mon, Mar 2, 2015 at 6:03 PM, Davis E. King notifications@github.com wrote:

Thanks. Please also include a working python program that when executed causes this large ram usage bug to appear so that I can debug it.

— Reply to this email directly or view it on GitHub https://github.com/mit-nlp/MITIE/issues/11#issuecomment-76704323.

Kanwal Prakash Singh #Data Housing.com +919619281431

davisking commented 9 years ago

I've looked this over and the problem is in the training data given to MITIE. However, before I explain the issue it's helpful to understand a little about how MITIE works. In MITIE, each sentence is chunked into entities and then each chunk is labeled by a multiclass classifier (a multiclass support vector machine specifically). To classify each chunk, MITIE creates a 501354 dimensional vector which is given to the multiclass classifier.

Now the way the multiclass classifier works is it learns one linear function for each label (and an addition one for the special 'not an entity' category). So if you have N labels then the classifier has 501354(N+1) numbers it needs to learn. Moreover, since we use a cutting plane solver there is an additional factor of RAM usage in the solver, let's call it Z. The amount of RAM used by the multiclass trainer is 501354(N+1)Zsizeof(double) + (the amount of RAM needed to store the training data). That means there are 501354(N+1)Z*sizeof(double) bytes of RAM usage no matter the size of your dataset.

The Z value is normally in the range 40-80. However, if you give input data that is basically impossible to correctly classify then the solver needs to work harder to find a way to separate it so Z might go up to about 200. It will also take a long time to train. In your case, you gave data with these 18 labels: 'builder', 'project', ' builder', 'size', 'location', 'price', ' price ', 'Infra', 'time', 'price psf', 'loction', ' time', ' size', ' location', ' infra', ' price', ' project', 'infra'.

Now what's happening is MITIE is trying to figure out, for example, how to separate the ' price', 'price', and 'price ' labels but this is probably impossible as I'm sure you meant to give all these things the same label. But the MITIE solver still tries and it needs a large Z to build up a high accuracy solution that can do this. So if Z=200 and N=18 then 501354(N+1)Z*sizeof(double)=14GB of RAM.

So the solution is to fix your training data so that the labels make sense. If you do that then I would expect more like 2GB-4GB of RAM usage. I have also updated the MITIE code to print the labels supplied by the user. So if you pull from github and rerun it will show these labels and that should make this kind of error a lot easier to spot in the future.

gagan-bansal commented 7 years ago

@davisking Thanks for such detailed explanation about MITIE working. It solved my issue regarding the long training duration. I am using rasa_nlu with MITIE. There are few issues(160 and 260) in rasa_nlu and root cause may be the one you have explained here.

ghost commented 6 years ago

gagan-bansal did you solve your issue?

mit-nlp / MITIE

Training ner on a a new corpus #11