pat / thinking-sphinx

Sphinx/Manticore plugin for ActiveRecord/Rails
http://freelancing-gods.com/thinking-sphinx
MIT License
1.63k stars 468 forks source link

memory bloat in development mode #232

Closed ghazel closed 12 years ago

ghazel commented 13 years ago

It seems that with a define_index in development mode, something about class-reloading causes the Rails (2.3.11) server to bloat significantly in memory. This does not seem to be a problem in production, presumably because classes are not reloaded, and is not a problem if define_index is conditionally disabled in development mode.

pat commented 13 years ago

Hi Greg

Can you try listing the models with Sphinx indices in your config/sphinx.yml file, and see if that helps? It should stop TS loading all models to find out which are indexed.

development:
  indexed_models:
    - Article
    - User
ghazel commented 13 years ago

Yes, that seems to also prevent the problem from occurring.

pat commented 13 years ago

Good to know. It's not the ideal fix, and I would prefer TS not have to determine which models are indexed on every request, but I've not found a way around that.

ghazel commented 13 years ago

What do you need it for exactly? I've encountered two interesting methods for interacting with "all model classes":

This one is better:


module ActiveRecord
  class Base

    class << self
      def inherited_with_ts(child)
        inherited_without_ts(child)
        child.class_eval do
          Do Something
        end
      end
      alias_method_chain :inherited, :ts
    end

  end
end

If the model classes are already loaded (more of a hack, try to avoid):

models = []
ObjectSpace.each_object(Class) do |c|
  next if not c.ancestors.include?(ActiveRecord::Base)
  next if c == ActiveRecord::Base
  next if not c.table_exists?
  next if c.abstract_class?
  models << c
end

models.each do |model|
  model.instance_eval do
    Do Something
  end
end
pat commented 13 years ago

The former is only going to fire once the models are loaded - so we still hit the same problem (don't know which models are indexed until they're loaded).

The reason they're needed is two-fold: firstly, to ensure all document ids are unique across an entire Sphinx setup (all indices, not just per-index), they're calculated with an offset and multiplier based on the primary key of the model, the number of models indexed, and this model's place in the list of indexed models. So this means all models need to be loaded whenever Sphinx is generating the configuration file (which it does as part of the indexing process).

Secondly: Sphinx has not (until 1.10 and 2.0.x releases) had proper string attribute support, which means determining which model a Sphinx document in the search results belongs to is sketchy at best - each search result just has attributes, with no index name. Thinking Sphinx works around this by storing a CRC32'd hash of the model name in an attribute (thus, an integer), and then pairs model names to CRC32 values and when populating results with model objects, uses those pairs to determine which model the result is from. This isn't so critical when we're dealing with single model searches - but can have an impact for STI and cross-app searches. And how can TS be sure whether STI is in play without loading all the models?

Now, just recently I've allowed for both the crc attribute (class_crc) and - if Sphinx 2.0.x is being used - a proper string attribute with the class name. While I was replying to this issue earlier, I had the thought that maybe we don't need to load all the models in this second situation, because we have a proper string of the model name.

I need to think this through and double-check the edge cases, but maybe we can improve the situation significantly - at least, when people are using the latest Sphinx release.

Hopefully some of this makes sense - it turned into a mini-essay ;)

ghazel commented 13 years ago

I see!

Curious set of issues it seems like you already have a good handle on. One question though; why did loading all of the models cause such tremendous memory bloat? It seemed to be larger than if the page naturally uses all of the models.

pat commented 13 years ago

I'm really not sure why it has such a noticeable impact, I'm afraid.

As for my reworking around the problem, I gave it a shot, but there's still a need to have all indexes known about when searching, sadly. I'll keep thinking about it, though.

On 03/06/2011, at 11:02 AM, ghazel wrote:

I see!

Curious set of issues it seems like you already have a good handle on. One question though; why did loading all of the models cause such tremendous memory bloat? It seemed to be larger than if the page naturally uses all of the models.

Reply to this email directly or view it on GitHub: https://github.com/freelancing-god/thinking-sphinx/issues/232#comment_1295099

Govinda-Fichtner commented 13 years ago

It would be really nice if the "indexed_models:" option could be mentioned here: http://freelancing-god.github.com/ts/en/advanced_config.html Though it is possible thatI just overlooked it elsewhere

pat commented 13 years ago

Good point Govinda - sorry, should have put it there some time ago. It's now mentioned in the docs.

Govinda-Fichtner commented 13 years ago

Which docs? Couldn't find it at http://freelancing-god.github.com/ts/en/advanced_config.html ...

pat commented 13 years ago

It's at the bottom of that page (you may have a cached version), and also mentioned at the bottom of the Common Questions and Issues page too.

Govinda-Fichtner commented 13 years ago

Great! Seems it was really cached. Thanks for Thinking Sphinx and the got documentation!