Refactoring statistical hypothesis testing framework

lambday commented 10 years ago

This follows from a discussion with @karlnapf and @sejdino. Currently kernel two-sample test and kernel independence test work under two different branches in the class tree with many things in common. In addition, in near future we'd want to perform independence test with streaming data as well which is only possible presently with two-sample test (CStreamingMMD). In order to fit everything nicely and reuse the components that are already there, we think that its better to separate the data-fetching components with the computation-components.

Here are some initial thoughts on this (needs further discussions):

CTestData base class (not abstract)

This class will be responsible for providing interface for fetching data, either as a whole or blockwise, either merged or unmerged. So it will provide all four methods :

get_samples() : returns a CList of feature objects of all num_samples for all the features
get_merged_samples() : returns a CFeatures instance of merged features (as a whole) created by CFeatures::create_merged_copy()
get_blocks(index_t num_blocks_current_burst) : returns a CList of feature objects fetched num_blocks times blockwise (maybe for dense features, we can use linalg::blocks which just creates a wrapper for the data without having to allocate separate memory for them)
get_merged_blocks(index_t num_blocks_current_burst) : returns a merged CFeatures instance for num_blocks from all the underlying distributions.

This class will also be able to handle more than two distributions (since we may soon move into three variable interaction or higher, this will make it flexible). So we'll maintain a m_num_distributions inside CTestData for that which can be set by the user. Then we'll register the samples from each distribution using a register_samples(index_t idx, CFeatures* samples) method which internally keeps them in an array of size m_num_distribution, keeping track of num_samples for each via CFeatures->get_num_vectors() interface, idx being the index in that array (for example if we have two distributions p and q, p will be set at arr[0] and so on). For the case where we want to process these samples blockwise, a set_blocksize(index_ idx) method will be available for the user to set the blocksize per distribution. So B_x B_y things are taken care of here.

Also, we can add the shuffling code in CTestData itself. Computation components shouldn't bother about data at all. Inside CTestData we can have methods to shuffle the samples from just one distribution or merge both, shuffle and redistribute in the same proportion for sampling null.

StreamingTestData will just be a subclass of this class which uses streaming features to provide the same interface to the computation components.

Now, for the computation hierarchy, we will have one CTestData* m_test_data as a component and have a bunch of compute methods. Base class is still CHypothesisTest which will still provide interface for

compute_statistic()
compute_blockwise_statistic() [maybe put this somewhere down the hierarchy]
compute_variance()
compute_p_value()
compute_threshold()
perform_test()
sample_null()

The rest of the hierarchy can be same. In the subclasses of CIndependenceTest we'll use m_test_data->get_samples()/m_test_data->get_blocks(index_t num_blocks_current_burst) inside the compute_statistic()/compute_blockwise_statistic() methods and for the subclasses of CTwoSampleTest we'll use m_test_data->get_merged_samples()/m_test_data->get_merged_blocks(index_t num_blocks_current_burst) accordingly.

The way it will work is: First the user will create an instance of CTestData (for CDenseFeatures) or CStreamingTestData (for CStreamingFeatures). Then he'll feed the test data to some CHypothesisTest instance and use the rest just as before. This will work for independence test/conditional-independence test, two-sample test/three-var interaction test. This will also work for different types of features as required by the independence test framework. Also, if someone wants to perform a B-test for non-streaming samples he can (may not be useful though but we don't have to create an instance of CStreamingFeatures from CFeatures each time when someone wants to do that).

Also, (thanks to @karlnapf for the suggestion) since "the code to compute all these statistics is the same (all current tests in some way stream through the data and compute statistics, and it is really just the way the statistic is computed that changes from say hsic to mmd) so it would be great to generalise this too. So that the difference is really just plugging in different statistic computation methods."

karlnapf commented 10 years ago

I think we need to distinguish between methods in the hypothesis test base class that streams data in the way its done for all tests (with certain ways to permute things), and then the methods that are called to compute things on those blocks. It would be great if the difference between an independence and a two-sample test is just a few lines of code in an abstract method implementation and the overall logic (as mathematically the same) stays in a base class.

lambday commented 10 years ago

@karlnapf just wondering, can we put these under the computation framework? We send different blocks (or better, one whole burst) as different computation jobs to the computation engine and job result aggregator computes the running average? will be ready to work on cluster then.

karlnapf commented 10 years ago

Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.

A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)

lambday commented 10 years ago

Reading modern c++ design book now.. Lots of nice ideas here.. Let's try to finalize the design ASAP.. I'll try to come up with a better design soon.. As soon as we fix all the bits and pieces in the design I'll work on it.. Coding won't take long I think.. Maybe we'd want to modify the job framework as well.. So for now maybe its better that we go for something that can readily be translated into independent job framework.. Then when the back ends are ready we put things under that .. (Thanks yo the issues you already created )...

Will get back with a class diagram soon... On Aug 23, 2014 9:57 PM, "Heiko Strathmann" notifications@github.com wrote:

Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.

A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)

— Reply to this email directly or view it on GitHub https://github.com/shogun-toolbox/shogun/issues/2495#issuecomment-53157753 .

karlnapf commented 10 years ago

I absolutely agree on this. Let's do proper design thinking before we start hacking things. In fact, I am about to start a thread on internal clean-ups that we can do. If you come across things in your c++ design book, please share them there. We want to make Shogun's internals more modern.

karlnapf commented 10 years ago

The hack will be a great opportunity to talk (and start pushing) such things.

lambday commented 10 years ago

@karlnapf a few thoughts regarding this: we can rely on a policy-based design pattern for the job.

Firstly: as we already saw while designing linalg library, I just re-established with this example that this approach is indeed significantly faster than relying on pure virtual interfaces like java does (both policy.cpp and virtual.cpp compiled with g++ -O3, policy.log and virtual.log shows their corresponding rutime). In fact, I think we should aim to exploit every bit of the magical power of C++ to gain any additional edge over other counterparts. We can easily keep these as shogun internals and expose their functionality to the modular API with suitable interfaces. The idea that I am trying to push here is to keep the main time consuming components separate in the internals and use the CSGObject subclasses mostly as wrappers for controlling the logical flow using those internal classes/methods. This will lead to a more modular structure for the codebase I think.

Secondly: talking about this problem particularly, I think it intuitively fits nice in the policy design pattern because what we want to achieve here is to punch a number of different policies for data fetching and computing the statistic combinatorially instead of trying to rely on inefficient hierarchical structure. For example, in the previous design I proposed, there is no reason that CStreamingTestData should be a subclass of CTestData except for the fact than to override the methods. We can handle this differently.

Another very interesting point is that we can readily make big tests available not just for real numerical data types but for (since we can soon move into) other feature types as well - string, sparse, graphs, signals without much effort!

So here is how I am thinking it can work: hypothesis_test Although this image is quite not perfect, but the idea is - to keep the components on the right hand side as internals and use the left hand side classes for wrappers under CSGObject for modular interface.

The main three internal components here are

TestDataManager: Responsible for fetching data - for any number of distributions, for any feature types, either all of the samples or blockwise (using same burst concept) with all sort of permutations handled inside (shuffling samples from one distribution while keeping samples from others same - for independence test or merging and shuffling both samples). Its just one class with compile-time dependency on feature type, fetching policy and permutation policy and it provides a uniform get_samples interface which returns the appropriate thing - merged Features object (for two-sample test) or a list of Features (for independence test) ...[or pretty much anything we'd like in future, it would just require to write an appropriate fetching policy and permutation policy for that without having to touch this class]
TestKernelManager: Kernel computation logic goes here - whether it is precomputed kernel or not, whether multiple kernels are provided or not - all handled here. This can simply use the TestDataManager to get the samples and compute kernel. It would provide get_kernels method via which other parts of the framework would use this component. Kernel computation logic should be kept separate from the computation logic. Similar to TestDataManager it should work with any type of kernel (dense, sparse, string).
Computation Components: These are the core of statistic computation. It just asks the kernel manager for the kernel and returns the statistic computed. For MMD specially this would be useful - we can just provide one ComputeQuadraticTimeMMD and use that in other (which solves the issue of having to copy the code in other MMD classes since it was costly before because it required creating an entire instance of a heavyweight class inside other MMD classes - plus it kinda messed up the neat hierarchy).

The CSGObject subclasses for HypothesisTest - simply connect these dots. Also test specific things (like gamma approximations etc) can be added on case-to-case basis here.

I was writing the POC code for TestDataManager in this repo which currently provides all the features that I mentioned with a really tight packed source. The usage of this is really clean and it reads the same code-wise for all types of purpose. I included some tests here for various use-cases (makefile included :P).

Few other points I can think of that we should resolve (before I forget):

For sampling null, we currently rely on a virtual call comupute_statistic. Inside a loop vtable lookup is slow. Using proposed structure hopefully we won't need that.
Presently in streaming mmd tests, in compute_statistic_and_variance method the kernel is computed twice for each burst - one for computing the statistic and another for computing variance estimate after shuffling the feature. We can save a lot of time by just precomputing the kernel and relying on index permutation. In fact I think get_kernel method of TestKernelManager should always return a precomputed kernel - in case the kernel is too big it suggests that we should better rely on doing thing block-wise. In that case, we'd just be precomputing the kernel for a block. Plus, for sampling null it would be a plus always. What do you think about this?

@lisitsyn I need your comments as well on this - specially we need to discuss if this design can fit into a plugin architecture - which part goes where etc. I apologize in advance for such a long post and the messed up status of the example src.

If all these pieces fit together then maybe we can restructure kernels as well (gotta get sparse kernel working).

Please share your thoughts and suggestions on this.

karlnapf commented 10 years ago

Great great suggestions. I suggest we discuss at the Stammtisch tonight?

lambday commented 10 years ago

@karlnapf absolutely! Please let me know what time would be suitable for the Stammtisch. @lisitsyn could you also please try to make it as well?

karlnapf commented 10 years ago

I was in a bit later yesterday (sorry) but could not catch you then. Shall we meet some other day? Ill try to hang out a bit later this afternoon

lambday commented 10 years ago

@karlnapf absolutely. I also slept off earlier than usual yesterday. Let me know when you're free :)

karlnapf commented 10 years ago

I will try to hang our in IRC a bit this week, so then we can talk. The Stammtisch on Monday should happen, I will send an email

lambday commented 10 years ago

@karlnapf my apologies for the long delay. Just came back from a trip home (festive season in India).

While working with KernelManager I discovered some issues with my previous plan. Initially I thought that we could have used a templated kernel manager class

template <class Kernel> class KernelManager

But to use this in a non-template hypothesis test base class we'd have to use a specialized kernel which is restrictive. Also for independence test we may use two different type of kernels. Similar argument can be given for DataManager as well - for example, we can use real dense features for one (samples from one distribution) and string features for another (from another distribution) in an independence test. Using a DataManager templated on Features would therefore restrict such use-cases.

A few way outs I could have thought of -

make hypothesis tests wrapper classes (subclasses of CSGObject) templated : but this would bloat up the binary size as we'd have to instantiate for each kernel-type and feature-type - so not an option.
the other idea is to choose the type-specific tasks at the time of setting up the features and kernel. This could be done by (1) making CHypothesisTestSubclass::set_*() (features or kernels) methods templated or (2) leave it up to the DataManager and KernelManager to handle this.

I'm in favor of (2) because making our wrapper's setters templated would require that we have to send it by the actual type and not via base pointers. Then we cannot use this to instantiate it inside a polymorphic method (say, inside virtual CFeatures* CFeatureSelection::apply(CFeatures* feats) the type information for the actual feature type is lost - no way in compile time we can know).

Having all these in mind, I thought of the following structure - In one side we have CSGObject wrappers and on the other side there are internal DataManager, KernelManager and Computation policies. These two layers will communicate with each other via TestType - a test type is sort of a meta-info for choosing appropriate policies for a particular test. The wrappers would just have a concrete test type and the internals would be instantiated with that. So the basic structure would look something like

template <typename TestType>
class CHypothesisTest : public CSGObject {
    typedef TestType test_type;
    ....
    DataManager<test_type> data_manager;
};

class CTwoSampleTest : public CHypothesisTest<TwoSampleTest> {
    typedef TwoSampleTest test_type;
    ....
};

class CStreamingTwoSampleTest : public CHypothesisTest<StreamingTwoSampleTest> {
    typedef StreamingTwoSampleTest test_type;
    ....
};

class CKernelTwoSampleTest : public CTwoSampleTest {
    typedef CTwoSampleTest::test_type test_type;
    ....
    KernelManager<test_type> kernel_manager;
};

// and so on...

This way the internal details are perfectly hidden from the wrappers and rest is left up to the DataManager and KernelManager. In the updated implementation for the DataManager I have

template <typename TestType>
struct DataManager {
    template <typename Features>
    void push_back(Features* feats);
    ....
    vector<CFeatures*> samples;
};

The idea is - based on feature-type, we'd set fetcher policy and permutation policy and for that we'd rely on the fact that push_back is called from CHypothesisTest::set_p() or CHypothesisTest::set_q() with the actual type. Therefore we'd have a heterogeneous container for features as well as their corresponding fetching+permutation policy all set and ready to be used. Fetching policies can be used to fetch all the data (non-streaming) or block-wise (streaming) - which policy to use is all decided by the TestType - TwoSampleTest or StreamingTwoSampleTest (same for Independence...).

Couple of other ideas -

It can also be possible to remove dependencies on CFeatures completely for hypothesis test by using a container of Any instead that @lisitsyn is using in aer. This will allow us to specify some other feature/kernel type and write a few policies and we'll have hypothesis test ready for that type - e.g. if someone uses Eigen::SparseMatrix for sparse feature class he'd be able to make the whole thing work with minimal effort (no change in existing code required - just addition).
In the setter methods I need to down cast to the actual feature type from CFeatures* type. Using the enums for features and kernel its easy but there can be a better way where I can use some de-mangling stuffs for that, so that things don't break even when some feature type doesn't have those enums. What I am looking for is some sort of REGISTER_FEATURE or REGISTER_KERNEL facility which stores the valid names inside some static string array and while casting we rely on the object's de-mangled names. @lisitsyn I need your input on this.

I haven't chalked up the kernel manager yet but I suppose it can work similarly as of data manager. There are some linking issues with the current poc code that I'll try to get rid of before our next meeting. Again sorry for the long post - just noting down the things related to the design decisions so that I don't forget.

Hope to talk to you soon!

karlnapf commented 10 years ago

Hi @lambday sorry for the delayed response. I was on holiday :) I will read this shortly and try to give some feedback. We still have to finish this paper!

lambday commented 10 years ago

@karlnapf Hi, hope you had a nice holiday. I updated a working example on the POC code here.

Run make inside src/
add src/ to LD_LIBRARY_PATH (for libflash.so)
Run make in root dir.

Some more testing needed. Before our next meeting I will try to code up KernelManager - will be good for the discussion if we have a working example ready. Should next Monday be fine then?

karlnapf commented 10 years ago

Hi, just cloned this and tried to compile but I think you might have forgot to commit the flash/statistics/StreamingIndependenceTest.h file

lambday commented 10 years ago

@karlnapf just updated. Please check.

lambday commented 10 years ago

I think the benefit of using DataManager interface is mainly the ability to mix and match different feature class and types without having the wrapper worry about it. I'll try to do something for string feature class. All it requires is to know how to permute the features. Same goes for sparse.

karlnapf commented 10 years ago

yep, I agree, that is nice. Checking now.

BTW, if this turns out to be nice, I suggest we approach the CFeatures base class in that way. Seems way cleaner than the mess we currently have. Let me know your initial thoughts (maybe via mail, not in this thread, or another github issue, or even wiki)

lambday commented 10 years ago

@karlnapf yeah that would be good. Also, putting the computation part as policies can also help us refactoring kernel in general - plugin different policies for computing a kernel for different features and we have dot kernel ready for dense, sparse and others. So, even when we rely on a hierarchical class structure for our wrappers (for exposing to modular interface) we can exploit the similarities between the branches. I'll prepare a draft for these ideas.

lambday commented 8 years ago

Done in feature/bigtest. Closing.

shogun-toolbox / shogun

Refactoring statistical hypothesis testing framework #2495