shogun-toolbox / shogun

Shōgun
http://shogun-toolbox.org
BSD 3-Clause "New" or "Revised" License
3.03k stars 1.04k forks source link

Refactoring statistical hypothesis testing framework #2495

Closed lambday closed 8 years ago

lambday commented 10 years ago

This follows from a discussion with @karlnapf and @sejdino. Currently kernel two-sample test and kernel independence test work under two different branches in the class tree with many things in common. In addition, in near future we'd want to perform independence test with streaming data as well which is only possible presently with two-sample test (CStreamingMMD). In order to fit everything nicely and reuse the components that are already there, we think that its better to separate the data-fetching components with the computation-components.

Here are some initial thoughts on this (needs further discussions):

CTestData base class (not abstract)

This class will be responsible for providing interface for fetching data, either as a whole or blockwise, either merged or unmerged. So it will provide all four methods :

This class will also be able to handle more than two distributions (since we may soon move into three variable interaction or higher, this will make it flexible). So we'll maintain a m_num_distributions inside CTestData for that which can be set by the user. Then we'll register the samples from each distribution using a register_samples(index_t idx, CFeatures* samples) method which internally keeps them in an array of size m_num_distribution, keeping track of num_samples for each via CFeatures->get_num_vectors() interface, idx being the index in that array (for example if we have two distributions p and q, p will be set at arr[0] and so on). For the case where we want to process these samples blockwise, a set_blocksize(index_ idx) method will be available for the user to set the blocksize per distribution. So B_x B_y things are taken care of here.

Also, we can add the shuffling code in CTestData itself. Computation components shouldn't bother about data at all. Inside CTestData we can have methods to shuffle the samples from just one distribution or merge both, shuffle and redistribute in the same proportion for sampling null.

StreamingTestData will just be a subclass of this class which uses streaming features to provide the same interface to the computation components.

Now, for the computation hierarchy, we will have one CTestData* m_test_data as a component and have a bunch of compute methods. Base class is still CHypothesisTest which will still provide interface for

The rest of the hierarchy can be same. In the subclasses of CIndependenceTest we'll use m_test_data->get_samples()/m_test_data->get_blocks(index_t num_blocks_current_burst) inside the compute_statistic()/compute_blockwise_statistic() methods and for the subclasses of CTwoSampleTest we'll use m_test_data->get_merged_samples()/m_test_data->get_merged_blocks(index_t num_blocks_current_burst) accordingly.

The way it will work is: First the user will create an instance of CTestData (for CDenseFeatures) or CStreamingTestData (for CStreamingFeatures). Then he'll feed the test data to some CHypothesisTest instance and use the rest just as before. This will work for independence test/conditional-independence test, two-sample test/three-var interaction test. This will also work for different types of features as required by the independence test framework. Also, if someone wants to perform a B-test for non-streaming samples he can (may not be useful though but we don't have to create an instance of CStreamingFeatures from CFeatures each time when someone wants to do that).

Also, (thanks to @karlnapf for the suggestion) since "the code to compute all these statistics is the same (all current tests in some way stream through the data and compute statistics, and it is really just the way the statistic is computed that changes from say hsic to mmd) so it would be great to generalise this too. So that the difference is really just plugging in different statistic computation methods."

karlnapf commented 10 years ago

I think we need to distinguish between methods in the hypothesis test base class that streams data in the way its done for all tests (with certain ways to permute things), and then the methods that are called to compute things on those blocks. It would be great if the difference between an independence and a two-sample test is just a few lines of code in an abstract method implementation and the overall logic (as mathematically the same) stays in a base class.

lambday commented 10 years ago

@karlnapf just wondering, can we put these under the computation framework? We send different blocks (or better, one whole burst) as different computation jobs to the computation engine and job result aggregator computes the running average? will be ready to work on cluster then.

karlnapf commented 10 years ago

Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.

A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)

lambday commented 10 years ago

Reading modern c++ design book now.. Lots of nice ideas here.. Let's try to finalize the design ASAP.. I'll try to come up with a better design soon.. As soon as we fix all the bits and pieces in the design I'll work on it.. Coding won't take long I think.. Maybe we'd want to modify the job framework as well.. So for now maybe its better that we go for something that can readily be translated into independent job framework.. Then when the back ends are ready we put things under that .. (Thanks yo the issues you already created )...

Will get back with a class diagram soon... On Aug 23, 2014 9:57 PM, "Heiko Strathmann" notifications@github.com wrote:

Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.

A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)

— Reply to this email directly or view it on GitHub https://github.com/shogun-toolbox/shogun/issues/2495#issuecomment-53157753 .

karlnapf commented 10 years ago

I absolutely agree on this. Let's do proper design thinking before we start hacking things. In fact, I am about to start a thread on internal clean-ups that we can do. If you come across things in your c++ design book, please share them there. We want to make Shogun's internals more modern.

karlnapf commented 10 years ago

The hack will be a great opportunity to talk (and start pushing) such things.

lambday commented 10 years ago

@karlnapf a few thoughts regarding this: we can rely on a policy-based design pattern for the job.

Firstly: as we already saw while designing linalg library, I just re-established with this example that this approach is indeed significantly faster than relying on pure virtual interfaces like java does (both policy.cpp and virtual.cpp compiled with g++ -O3, policy.log and virtual.log shows their corresponding rutime). In fact, I think we should aim to exploit every bit of the magical power of C++ to gain any additional edge over other counterparts. We can easily keep these as shogun internals and expose their functionality to the modular API with suitable interfaces. The idea that I am trying to push here is to keep the main time consuming components separate in the internals and use the CSGObject subclasses mostly as wrappers for controlling the logical flow using those internal classes/methods. This will lead to a more modular structure for the codebase I think.

Secondly: talking about this problem particularly, I think it intuitively fits nice in the policy design pattern because what we want to achieve here is to punch a number of different policies for data fetching and computing the statistic combinatorially instead of trying to rely on inefficient hierarchical structure. For example, in the previous design I proposed, there is no reason that CStreamingTestData should be a subclass of CTestData except for the fact than to override the methods. We can handle this differently.

Another very interesting point is that we can readily make big tests available not just for real numerical data types but for (since we can soon move into) other feature types as well - string, sparse, graphs, signals without much effort!

So here is how I am thinking it can work: hypothesis_test Although this image is quite not perfect, but the idea is - to keep the components on the right hand side as internals and use the left hand side classes for wrappers under CSGObject for modular interface.

The main three internal components here are

The CSGObject subclasses for HypothesisTest - simply connect these dots. Also test specific things (like gamma approximations etc) can be added on case-to-case basis here.

I was writing the POC code for TestDataManager in this repo which currently provides all the features that I mentioned with a really tight packed source. The usage of this is really clean and it reads the same code-wise for all types of purpose. I included some tests here for various use-cases (makefile included :P).

Few other points I can think of that we should resolve (before I forget):

@lisitsyn I need your comments as well on this - specially we need to discuss if this design can fit into a plugin architecture - which part goes where etc. I apologize in advance for such a long post and the messed up status of the example src.

If all these pieces fit together then maybe we can restructure kernels as well (gotta get sparse kernel working).

Please share your thoughts and suggestions on this.

karlnapf commented 10 years ago

Great great suggestions. I suggest we discuss at the Stammtisch tonight?

lambday commented 10 years ago

@karlnapf absolutely! Please let me know what time would be suitable for the Stammtisch. @lisitsyn could you also please try to make it as well?

karlnapf commented 10 years ago

I was in a bit later yesterday (sorry) but could not catch you then. Shall we meet some other day? Ill try to hang out a bit later this afternoon

lambday commented 10 years ago

@karlnapf absolutely. I also slept off earlier than usual yesterday. Let me know when you're free :)

karlnapf commented 10 years ago

I will try to hang our in IRC a bit this week, so then we can talk. The Stammtisch on Monday should happen, I will send an email

lambday commented 10 years ago

@karlnapf my apologies for the long delay. Just came back from a trip home (festive season in India).

While working with KernelManager I discovered some issues with my previous plan. Initially I thought that we could have used a templated kernel manager class

template <class Kernel> class KernelManager

But to use this in a non-template hypothesis test base class we'd have to use a specialized kernel which is restrictive. Also for independence test we may use two different type of kernels. Similar argument can be given for DataManager as well - for example, we can use real dense features for one (samples from one distribution) and string features for another (from another distribution) in an independence test. Using a DataManager templated on Features would therefore restrict such use-cases.

A few way outs I could have thought of -

I'm in favor of (2) because making our wrapper's setters templated would require that we have to send it by the actual type and not via base pointers. Then we cannot use this to instantiate it inside a polymorphic method (say, inside virtual CFeatures* CFeatureSelection::apply(CFeatures* feats) the type information for the actual feature type is lost - no way in compile time we can know).

Having all these in mind, I thought of the following structure - In one side we have CSGObject wrappers and on the other side there are internal DataManager, KernelManager and Computation policies. These two layers will communicate with each other via TestType - a test type is sort of a meta-info for choosing appropriate policies for a particular test. The wrappers would just have a concrete test type and the internals would be instantiated with that. So the basic structure would look something like

template <typename TestType>
class CHypothesisTest : public CSGObject {
    typedef TestType test_type;
    ....
    DataManager<test_type> data_manager;
};

class CTwoSampleTest : public CHypothesisTest<TwoSampleTest> {
    typedef TwoSampleTest test_type;
    ....
};

class CStreamingTwoSampleTest : public CHypothesisTest<StreamingTwoSampleTest> {
    typedef StreamingTwoSampleTest test_type;
    ....
};

class CKernelTwoSampleTest : public CTwoSampleTest {
    typedef CTwoSampleTest::test_type test_type;
    ....
    KernelManager<test_type> kernel_manager;
};

// and so on...

This way the internal details are perfectly hidden from the wrappers and rest is left up to the DataManager and KernelManager. In the updated implementation for the DataManager I have

template <typename TestType>
struct DataManager {
    template <typename Features>
    void push_back(Features* feats);
    ....
    vector<CFeatures*> samples;
};

The idea is - based on feature-type, we'd set fetcher policy and permutation policy and for that we'd rely on the fact that push_back is called from CHypothesisTest::set_p() or CHypothesisTest::set_q() with the actual type. Therefore we'd have a heterogeneous container for features as well as their corresponding fetching+permutation policy all set and ready to be used. Fetching policies can be used to fetch all the data (non-streaming) or block-wise (streaming) - which policy to use is all decided by the TestType - TwoSampleTest or StreamingTwoSampleTest (same for Independence...).

Couple of other ideas -

I haven't chalked up the kernel manager yet but I suppose it can work similarly as of data manager. There are some linking issues with the current poc code that I'll try to get rid of before our next meeting. Again sorry for the long post - just noting down the things related to the design decisions so that I don't forget.

Hope to talk to you soon!

karlnapf commented 10 years ago

Hi @lambday sorry for the delayed response. I was on holiday :) I will read this shortly and try to give some feedback. We still have to finish this paper!

lambday commented 10 years ago

@karlnapf Hi, hope you had a nice holiday. I updated a working example on the POC code here.

Some more testing needed. Before our next meeting I will try to code up KernelManager - will be good for the discussion if we have a working example ready. Should next Monday be fine then?

karlnapf commented 10 years ago

Hi, just cloned this and tried to compile but I think you might have forgot to commit the flash/statistics/StreamingIndependenceTest.h file

lambday commented 10 years ago

@karlnapf just updated. Please check.

lambday commented 10 years ago

I think the benefit of using DataManager interface is mainly the ability to mix and match different feature class and types without having the wrapper worry about it. I'll try to do something for string feature class. All it requires is to know how to permute the features. Same goes for sparse.

karlnapf commented 10 years ago

yep, I agree, that is nice. Checking now.

BTW, if this turns out to be nice, I suggest we approach the CFeatures base class in that way. Seems way cleaner than the mess we currently have. Let me know your initial thoughts (maybe via mail, not in this thread, or another github issue, or even wiki)

lambday commented 10 years ago

@karlnapf yeah that would be good. Also, putting the computation part as policies can also help us refactoring kernel in general - plugin different policies for computing a kernel for different features and we have dot kernel ready for dense, sparse and others. So, even when we rely on a hierarchical class structure for our wrappers (for exposing to modular interface) we can exploit the similarities between the branches. I'll prepare a draft for these ideas.

lambday commented 8 years ago

Done in feature/bigtest. Closing.