InternalDataSequence::accumulateFeatureData very high memory usage

hiroshi-manabe / CRFSegmenter

A multi-language segmenter using high-order CRF.

Other

17 stars 2 forks source link

InternalDataSequence::accumulateFeatureData very high memory usage #7

Open bratao opened 7 years ago

bratao commented 7 years ago

Hello Again @hiroshi-manabe ,

I´m writing a Python wrapper for your excellent library, that I plan to release soon.

However, porting some internal projects to this library I can see that the memory usage exploded compared to CRFSuite.

I just started to analyze if I can improve the memory usage. I plan to use some compact data structures to store the data, such as https://github.com/Tessil/hat-trie , and I already got some good improvements.

Now I got to InternalDataSequence::accumulateFeatureData(). It is responsible for 70% of memory usage during training. Do you have an idea how is possible to optimize it?

Thank you

hiroshi-manabe commented 7 years ago

Hello @bratao,

I´m writing a Python wrapper for your excellent library, that I plan to release soon.

Thank you for your work! I'm really excited to hear that :)

I'm on vacation at Montreal now, I'll respond you after the 12th.

A bientôt!

hiroshi-manabe commented 7 years ago

Hello @bratao,

Now I got to InternalDataSequence::accumulateFeatureData(). It is responsible for 70% of memory usage during training. Do you have an idea how is possible to optimize it?

I thought the part that uses the memory the most would be this ( https://github.com/hiroshi-manabe/CRFSegmenter/blob/4c274a871a4e90b727b2cd166f4367ce71e0b519/HighOrderCRF/HighOrderCRFProcessor.cpp#L173 ), which can be hundreds of GB in my application (CJK segmentation / POS tagging), so I wasn't very serious about optimizing accumulateFeatureData().

Can you give me the data you used for testing?