stmichael / data-import

sequel based dsl to migrate data from a legacy database to a new home.
MIT License
42 stars 9 forks source link

Provide options to import mapping execution contexts. #54

Open andyt opened 10 years ago

andyt commented 10 years ago

You might want a few more unit tests, and there's a possibility this doesn't cover all execution contexts - though it meets all mine.

I've added a new acceptance spec.

stmichael commented 10 years ago

This is not related to #38, is it? Couldn't your use case be solved with the helper modules discussed in #38?

andyt commented 10 years ago

I don't think helper modules would cover my use case. I don't think this is related really - helper modules are static extensions to the execution context, whereas I needed to make them configurable. That said, I would find helper modules useful, hence my comments on the original ticket.

senny commented 10 years ago

@andyt why not use ENV variables for this configuration?

stmichael commented 10 years ago

Your solution is very specific to your use case. I don't like the idea of passing the options all around the code just to make them available in the execution context.

andyt commented 10 years ago

Extending the context of a mapping would seem to be quite a common requirement. You might want to do this for configuration or scoping, perhaps based on some selections in an interface.

However, I agree that passing options everywhere isn't elegant, although that's a side effect of the multiple contexts used. If there was one context (if such a thing was possible), it wouldn't be so pervasive.

In my use case, ENV variables won't work. I'm using your library within https://github.com/mperham/sidekiq, which is multi-threaded. I would prefer to avoid using Thread.current, which would be another option.

Anyway, I'm open to other suggestions. Thanks again for a great library.

stmichael commented 10 years ago

@andyt Did you find any solution for your problem?

I just had another idea. Suppose these helper modules from issue #38 already existed. How about you take the parameters you receive from sidekiq and feed them into a module which will then be included in the import blocks. That's more structured than just defining global variables und the DSL won't get that much complicated.

andyt commented 10 years ago

I think your suggestion suffers from the same problem - unless I can make the module name for inclusion dynamic, then surely I can only include the same module each time, and I have to resort to using Thread.current. Or am I misunderstanding your suggestion?

stmichael commented 10 years ago

I don't know your exact use case and the environment in which you'd like to use data-import. I assumed you wanted to pass some arguments you received from sidekiqs perform method to data-import. In this case I think it would be possible to use the helpers.

module OptionsHelper
  # methods to read and write options
end

class Worker
  def perform(options)
    fill_options_into_helper(options)
    DataImport.run_config! mapping_path
  end
end

And in the mapping you include OptionsHelper.

@andyt what do you think?

andyt commented 10 years ago

Yes, that's the kind of thing I thought you were suggesting. Wouldn't all threads in a process access the same options?

stmichael commented 10 years ago

Oh, now I see your point. Sorry, it took so long :cry:

I don't know sidekiq but if the jobs all run in the same process, the options would indeed be the same.

I'm not a great fan of background job libraries that run in one process like delayed_job (although I'm not sure if this is still the case) and apparently sidekiq too. We once had the use case in which a worker had to process a large amount of data. Unfortunately the implementation of our worker was so lousy that it allocated a huge amount of memory without freeing it completely. After 3-4 finished jobs the worker process allocated about 2GB of memory. (Ruby doesn't free memory it once allocated). After another few jobs the memory was full. Even swapping became impossible. And this lead to a complete freeze of the machine including the main website.

Of course it would have been wise to make the implementation more memory aware. But still, I find it odd that such jobs make the machine freeze. I mean after a job ran I would expect it to vanish completely and not leaving any traces (except the ones it SHOULD leave). And that's why I love resque. Every job runs in a child process of the worker process. After the job finishes the child process will be destroyed and thus freeing all allocated memory and resources.

Would that be an option for you?

andyt commented 10 years ago

I like Resque too, but Sidekiq has a better memory footprint (given no leaks!). We could switch to Resque, but the commit I made in my fork is working for me.