Closed matsen closed 12 years ago
Bootstrapping should be implemented as follows.
Define a function fill_boot_row
that takes an n_words long vector and repeats the following step floor(n_words/word_length)
times: uniformly select an entry and add 1.0 to it.
Then initialize boot_matrix
, which should be a 100 (rows) by n_words (columns) float matrix which is initialized to all zeros. Then slice out every row and apply fill_boot_row
to it once. The result of this will be that boot_matrix
will have random rows, each of which sum to floor(n_words/word_length)
.
Now, assume that we are given a vector seq_word_counts
of word counts for a given sequence. To calculate the bootstrap confidence value for its classification C, repeat the following for every row boot_row
of boot_matrix
: put the pairwise product (see below) of boot_row
and seq_word_counts
into a booted_word_counts
vector. Then classify booted_word_counts
as before.
The different classifications obtained, along with their normalized frequencies, should then be stored analogous to the pplacer classifications with their likelihoods.
Here's the exceedingly simple vec_pairwise_prod_c. It's worth doing this in c. It needs to have an ocaml mli entry as well.
CAMLprim value vec_pairwise_prod_c(value dst_value, value x_value, value y_value)
{
CAMLparam3(dst_value, x_value, y_value);
double *dst = Data_bigarray_val(dst_value);
double *x = Data_bigarray_val(x_value);
double *y = Data_bigarray_val(y_value);
int i;
for(i=0; i < Bigarray_val(x_value)->dim[0]; i++) {
dst[i] = x[i] * y[i];
}
CAMLreturn(Val_unit);
}
Superseded by #250.
nbc = Naive Bayes Classifier.
The code here should be factored as usual, because eventually we are going to be doing an NBC + phylogenetic classification hybrid.
Note that there is some chance in the future we will want to change to a 32-bit int for
freq_table
, or even a float. I will use "vector" to mean Bigarray.Array1. There is some reasonable code to play with these things in Gsl_vector.Flags include:
Let
n_words
= 4^word_length
. Assert thatn_words < max_int
.We will denote the corresponding variables with underscores in place of dashes.
Start by grouping the reference sequences at the given target rank, discarding those that do not have a classification at that rank. Say there are
n_taxids
taxids represented by at least one sequence at the target rank.Allocate an int bigarray.Array2
freq_table
of widthn_words
and heightn_taxids
with all zero entries.We will need a subroutine
word_to_int
to turn an[ACGT]{word_length}
into an integer. This will work by turning every A, C, G, and T into a pair of bits likeGiven a word, we concatenate these bits together into an integer. Please make an inverse function and then test round-tripping.
Using this routine, build another
add_counts_by_seq
that takes a sequence and an int big-vector of lengthn_words
and marches along its de-gapped version, finding words and adding to the corresponding entry. Specifically, for every word W of the ungapped sequence, increase the corresponding entry of the vector by one. For the time being, ignore words that are not exclusively composed of {A,C,G,T}.Using this routine, fill out the entries in the
freq_table
as follows. If a given sequence has taxonomic classification X, pull out the row for that taxid using slice and pass it along the sequence toadd_counts_by_seq
.Make an float vector
prior_counts
such that the entry for word i is (n(w_i) + 0.5) / (N + 1), where n(w_i) is the total number of observations of w_i and N is the total number of sequences.Then build a float Array2 called
taxid_word_counts
that isn_taxids
byn_words
, where the entry at (x,i) is log (m(w_i) +prior_counts
[i]) -denom
; m(w_i) is the number of sequences with word i for taxid x anddenom
= log (M + 1) where M is the number of sequences with taxid x.Then, to classify a sequence, use
add_counts_by_seq
to get a vector of k-mer counts for that sequence. Multiply this vector by thetaxid_word_counts
matrix usingLinear_utils.mat_vec_mul
(into a pre-allocated vector) and find the index of the maximum value usingGsl_vector.max_index
. The taxid corresponding to this index is our classification.This classification should then get output via the previous classification machinery.