Closed lambday closed 8 years ago
I think we need to distinguish between methods in the hypothesis test base class that streams data in the way its done for all tests (with certain ways to permute things), and then the methods that are called to compute things on those blocks. It would be great if the difference between an independence and a two-sample test is just a few lines of code in an abstract method implementation and the overall logic (as mathematically the same) stays in a base class.
@karlnapf just wondering, can we put these under the computation framework? We send different blocks (or better, one whole burst) as different computation jobs to the computation engine and job result aggregator computes the running average? will be ready to work on cluster then.
Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.
A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)
Reading modern c++ design book now.. Lots of nice ideas here.. Let's try to finalize the design ASAP.. I'll try to come up with a better design soon.. As soon as we fix all the bits and pieces in the design I'll work on it.. Coding won't take long I think.. Maybe we'd want to modify the job framework as well.. So for now maybe its better that we go for something that can readily be translated into independent job framework.. Then when the back ends are ready we put things under that .. (Thanks yo the issues you already created )...
Will get back with a class diagram soon... On Aug 23, 2014 9:57 PM, "Heiko Strathmann" notifications@github.com wrote:
Yes, absolutely. If you have time for this, please do. All the blocks just are sent off to the engine. We though need a way to do such things without adding tons of new classes for every job.
A good thing to have would be a callback function job or something. Ideas on this? But maybe just experiment a bit on this. But thats all not priority currently, we first have to make things work. I dont think its a lot of hassle to work against computation engine framework. Maybe open an issue for this? so that we dont forget :)
— Reply to this email directly or view it on GitHub https://github.com/shogun-toolbox/shogun/issues/2495#issuecomment-53157753 .
I absolutely agree on this. Let's do proper design thinking before we start hacking things. In fact, I am about to start a thread on internal clean-ups that we can do. If you come across things in your c++ design book, please share them there. We want to make Shogun's internals more modern.
The hack will be a great opportunity to talk (and start pushing) such things.
@karlnapf a few thoughts regarding this: we can rely on a policy-based design pattern for the job.
Firstly: as we already saw while designing linalg
library, I just re-established with this example that this approach is indeed significantly faster than relying on pure virtual interfaces like java does (both policy.cpp
and virtual.cpp
compiled with g++ -O3
, policy.log
and virtual.log
shows their corresponding rutime). In fact, I think we should aim to exploit every bit of the magical power of C++ to gain any additional edge over other counterparts. We can easily keep these as shogun internals and expose their functionality to the modular API with suitable interfaces. The idea that I am trying to push here is to keep the main time consuming components separate in the internals and use the CSGObject
subclasses mostly as wrappers for controlling the logical flow using those internal classes/methods. This will lead to a more modular structure for the codebase I think.
Secondly: talking about this problem particularly, I think it intuitively fits nice in the policy design pattern because what we want to achieve here is to punch a number of different policies for data fetching and computing the statistic combinatorially instead of trying to rely on inefficient hierarchical structure. For example, in the previous design I proposed, there is no reason that CStreamingTestData
should be a subclass of CTestData
except for the fact than to override the methods. We can handle this differently.
Another very interesting point is that we can readily make big tests available not just for real numerical data types but for (since we can soon move into) other feature types as well - string, sparse, graphs, signals without much effort!
So here is how I am thinking it can work:
Although this image is quite not perfect, but the idea is - to keep the components on the right hand side as internals and use the left hand side classes for wrappers under CSGObject
for modular interface.
The main three internal components here are
get_samples
interface which returns the appropriate thing - merged Features
object (for two-sample test) or a list of Features
(for independence test) ...[or pretty much anything we'd like in future, it would just require to write an appropriate fetching policy and permutation policy for that without having to touch this class]TestDataManager
to get the samples and compute kernel. It would provide get_kernels method via which other parts of the framework would use this component. Kernel computation logic should be kept separate from the computation logic. Similar to TestDataManager
it should work with any type of kernel (dense, sparse, string).ComputeQuadraticTimeMMD
and use that in other (which solves the issue of having to copy the code in other MMD classes since it was costly before because it required creating an entire instance of a heavyweight class inside other MMD classes - plus it kinda messed up the neat hierarchy).The CSGObject
subclasses for HypothesisTest
- simply connect these dots. Also test specific things (like gamma approximations etc) can be added on case-to-case basis here.
I was writing the POC code for TestDataManager
in this repo which currently provides all the features that I mentioned with a really tight packed source. The usage of this is really clean and it reads the same code-wise for all types of purpose. I included some tests here for various use-cases (makefile included :P).
Few other points I can think of that we should resolve (before I forget):
comupute_statistic
. Inside a loop vtable lookup is slow. Using proposed structure hopefully we won't need that.compute_statistic_and_variance
method the kernel is computed twice for each burst - one for computing the statistic and another for computing variance estimate after shuffling the feature. We can save a lot of time by just precomputing the kernel and relying on index permutation. In fact I think get_kernel
method of TestKernelManager
should always return a precomputed kernel - in case the kernel is too big it suggests that we should better rely on doing thing block-wise. In that case, we'd just be precomputing the kernel for a block. Plus, for sampling null it would be a plus always. What do you think about this?@lisitsyn I need your comments as well on this - specially we need to discuss if this design can fit into a plugin architecture - which part goes where etc. I apologize in advance for such a long post and the messed up status of the example src.
If all these pieces fit together then maybe we can restructure kernels as well (gotta get sparse kernel working).
Please share your thoughts and suggestions on this.
Great great suggestions. I suggest we discuss at the Stammtisch tonight?
@karlnapf absolutely! Please let me know what time would be suitable for the Stammtisch. @lisitsyn could you also please try to make it as well?
I was in a bit later yesterday (sorry) but could not catch you then. Shall we meet some other day? Ill try to hang out a bit later this afternoon
@karlnapf absolutely. I also slept off earlier than usual yesterday. Let me know when you're free :)
I will try to hang our in IRC a bit this week, so then we can talk. The Stammtisch on Monday should happen, I will send an email
@karlnapf my apologies for the long delay. Just came back from a trip home (festive season in India).
While working with KernelManager
I discovered some issues with my previous plan. Initially I thought that we could have used a templated kernel manager class
template <class Kernel> class KernelManager
But to use this in a non-template hypothesis test base class we'd have to use a specialized kernel which is restrictive. Also for independence test we may use two different type of kernels. Similar argument can be given for DataManager
as well - for example, we can use real dense features for one (samples from one distribution) and string features for another (from another distribution) in an independence test. Using a DataManager
templated on Features
would therefore restrict such use-cases.
A few way outs I could have thought of -
CSGObject
) templated : but this would bloat up the binary size as we'd have to instantiate for each kernel-type and feature-type - so not an option.CHypothesisTestSubclass::set_*()
(features or kernels) methods templated or (2) leave it up to the DataManager
and KernelManager
to handle this.I'm in favor of (2) because making our wrapper's setters templated would require that we have to send it by the actual type and not via base pointers. Then we cannot use this to instantiate it inside a polymorphic method (say, inside virtual CFeatures* CFeatureSelection::apply(CFeatures* feats)
the type information for the actual feature type is lost - no way in compile time we can know).
Having all these in mind, I thought of the following structure - In one side we have CSGObject
wrappers and on the other side there are internal DataManager
, KernelManager
and Computation
policies. These two layers will communicate with each other via TestType
- a test type is sort of a meta-info for choosing appropriate policies for a particular test. The wrappers would just have a concrete test type and the internals would be instantiated with that. So the basic structure would look something like
template <typename TestType>
class CHypothesisTest : public CSGObject {
typedef TestType test_type;
....
DataManager<test_type> data_manager;
};
class CTwoSampleTest : public CHypothesisTest<TwoSampleTest> {
typedef TwoSampleTest test_type;
....
};
class CStreamingTwoSampleTest : public CHypothesisTest<StreamingTwoSampleTest> {
typedef StreamingTwoSampleTest test_type;
....
};
class CKernelTwoSampleTest : public CTwoSampleTest {
typedef CTwoSampleTest::test_type test_type;
....
KernelManager<test_type> kernel_manager;
};
// and so on...
This way the internal details are perfectly hidden from the wrappers and rest is left up to the DataManager
and KernelManager
. In the updated implementation for the DataManager
I have
template <typename TestType>
struct DataManager {
template <typename Features>
void push_back(Features* feats);
....
vector<CFeatures*> samples;
};
The idea is - based on feature-type, we'd set fetcher policy and permutation policy and for that we'd rely on the fact that push_back
is called from CHypothesisTest::set_p()
or CHypothesisTest::set_q()
with the actual type. Therefore we'd have a heterogeneous container for features as well as their corresponding fetching+permutation policy all set and ready to be used. Fetching policies can be used to fetch all the data (non-streaming) or block-wise (streaming) - which policy to use is all decided by the TestType
- TwoSampleTest
or StreamingTwoSampleTest
(same for Independence...).
Couple of other ideas -
CFeatures
completely for hypothesis test by using a container of Any
instead that @lisitsyn is using in aer. This will allow us to specify some other feature/kernel type and write a few policies and we'll have hypothesis test ready for that type - e.g. if someone uses Eigen::SparseMatrix for sparse feature class he'd be able to make the whole thing work with minimal effort (no change in existing code required - just addition).CFeatures*
type. Using the enums for features and kernel its easy but there can be a better way where I can use some de-mangling stuffs for that, so that things don't break even when some feature type doesn't have those enums. What I am looking for is some sort of REGISTER_FEATURE
or REGISTER_KERNEL
facility which stores the valid names inside some static string array and while casting we rely on the object's de-mangled names. @lisitsyn I need your input on this.I haven't chalked up the kernel manager yet but I suppose it can work similarly as of data manager. There are some linking issues with the current poc code that I'll try to get rid of before our next meeting. Again sorry for the long post - just noting down the things related to the design decisions so that I don't forget.
Hope to talk to you soon!
Hi @lambday sorry for the delayed response. I was on holiday :) I will read this shortly and try to give some feedback. We still have to finish this paper!
@karlnapf Hi, hope you had a nice holiday. I updated a working example on the POC code here.
src/
src/
to LD_LIBRARY_PATH
(for libflash.so
)Some more testing needed. Before our next meeting I will try to code up KernelManager - will be good for the discussion if we have a working example ready. Should next Monday be fine then?
Hi, just cloned this and tried to compile but I think you might have forgot to commit the flash/statistics/StreamingIndependenceTest.h
file
@karlnapf just updated. Please check.
I think the benefit of using DataManager interface is mainly the ability to mix and match different feature class and types without having the wrapper worry about it. I'll try to do something for string feature class. All it requires is to know how to permute the features. Same goes for sparse.
yep, I agree, that is nice. Checking now.
BTW, if this turns out to be nice, I suggest we approach the CFeatures base class in that way. Seems way cleaner than the mess we currently have. Let me know your initial thoughts (maybe via mail, not in this thread, or another github issue, or even wiki)
@karlnapf yeah that would be good. Also, putting the computation part as policies can also help us refactoring kernel in general - plugin different policies for computing a kernel for different features and we have dot kernel ready for dense, sparse and others. So, even when we rely on a hierarchical class structure for our wrappers (for exposing to modular interface) we can exploit the similarities between the branches. I'll prepare a draft for these ideas.
Done in feature/bigtest. Closing.
This follows from a discussion with @karlnapf and @sejdino. Currently kernel two-sample test and kernel independence test work under two different branches in the class tree with many things in common. In addition, in near future we'd want to perform independence test with streaming data as well which is only possible presently with two-sample test (
CStreamingMMD
). In order to fit everything nicely and reuse the components that are already there, we think that its better to separate the data-fetching components with the computation-components.Here are some initial thoughts on this (needs further discussions):
CTestData
base class (not abstract)This class will be responsible for providing interface for fetching data, either as a whole or blockwise, either merged or unmerged. So it will provide all four methods :
get_samples()
: returns aCList
of feature objects of allnum_samples
for all the featuresget_merged_samples()
: returns aCFeatures
instance of merged features (as a whole) created byCFeatures::create_merged_copy()
get_blocks(index_t num_blocks_current_burst)
: returns aCList
of feature objects fetchednum_blocks
times blockwise (maybe for dense features, we can uselinalg::blocks
which just creates a wrapper for the data without having to allocate separate memory for them)get_merged_blocks(index_t num_blocks_current_burst)
: returns a mergedCFeatures
instance fornum_blocks
from all the underlying distributions.This class will also be able to handle more than two distributions (since we may soon move into three variable interaction or higher, this will make it flexible). So we'll maintain a
m_num_distributions
insideCTestData
for that which can be set by the user. Then we'll register the samples from each distribution using aregister_samples(index_t idx, CFeatures* samples)
method which internally keeps them in an array of sizem_num_distribution
, keeping track ofnum_samples
for each viaCFeatures->get_num_vectors()
interface,idx
being the index in that array (for example if we have two distributions p and q, p will be set atarr[0]
and so on). For the case where we want to process these samples blockwise, aset_blocksize(index_ idx)
method will be available for the user to set the blocksize per distribution. So B_x B_y things are taken care of here.Also, we can add the shuffling code in
CTestData
itself. Computation components shouldn't bother about data at all. InsideCTestData
we can have methods to shuffle the samples from just one distribution or merge both, shuffle and redistribute in the same proportion for sampling null.StreamingTestData
will just be a subclass of this class which uses streaming features to provide the same interface to the computation components.Now, for the computation hierarchy, we will have one
CTestData* m_test_data
as a component and have a bunch of compute methods. Base class is stillCHypothesisTest
which will still provide interface forcompute_statistic()
compute_blockwise_statistic()
[maybe put this somewhere down the hierarchy]compute_variance()
compute_p_value()
compute_threshold()
perform_test()
sample_null()
The rest of the hierarchy can be same. In the subclasses of
CIndependenceTest
we'll usem_test_data->get_samples()
/m_test_data->get_blocks(index_t num_blocks_current_burst)
inside thecompute_statistic()
/compute_blockwise_statistic()
methods and for the subclasses ofCTwoSampleTest
we'll usem_test_data->get_merged_samples()
/m_test_data->get_merged_blocks(index_t num_blocks_current_burst)
accordingly.The way it will work is: First the user will create an instance of
CTestData
(forCDenseFeatures
) orCStreamingTestData
(forCStreamingFeatures
). Then he'll feed the test data to someCHypothesisTest
instance and use the rest just as before. This will work for independence test/conditional-independence test, two-sample test/three-var interaction test. This will also work for different types of features as required by the independence test framework. Also, if someone wants to perform a B-test for non-streaming samples he can (may not be useful though but we don't have to create an instance ofCStreamingFeatures
fromCFeatures
each time when someone wants to do that).Also, (thanks to @karlnapf for the suggestion) since "the code to compute all these statistics is the same (all current tests in some way stream through the data and compute statistics, and it is really just the way the statistic is computed that changes from say hsic to mmd) so it would be great to generalise this too. So that the difference is really just plugging in different statistic computation methods."