guppy nbc - Githubissues

nbc = Naive Bayes Classifier.

The code here should be factored as usual, because eventually we are going to be doing an NBC + phylogenetic classification hybrid.

Note that there is some chance in the future we will want to change to a 32-bit int for freq_table, or even a float. I will use "vector" to mean Bigarray.Array1. There is some reasonable code to play with these things in Gsl_vector.

Flags include:

-c              Ref package
--word-length   The length of the word used to do the NBC classification (Default 8)
--target-rank   The desired most specific rank for classification (Default genus)

Let n_words = 4^word_length. Assert that n_words < max_int.

We will denote the corresponding variables with underscores in place of dashes.

Start by grouping the reference sequences at the given target rank, discarding those that do not have a classification at that rank. Say there are n_taxids taxids represented by at least one sequence at the target rank.

Allocate an int bigarray.Array2 freq_table of width n_words and height n_taxids with all zero entries.

We will need a subroutine word_to_int to turn an [ACGT]{word_length} into an integer. This will work by turning every A, C, G, and T into a pair of bits like

A -> 0 0
C -> 0 1
G -> 1 0
T -> 1 1

Given a word, we concatenate these bits together into an integer. Please make an inverse function and then test round-tripping.

Using this routine, build another add_counts_by_seq that takes a sequence and an int big-vector of length n_words and marches along its de-gapped version, finding words and adding to the corresponding entry. Specifically, for every word W of the ungapped sequence, increase the corresponding entry of the vector by one. For the time being, ignore words that are not exclusively composed of {A,C,G,T}.

Using this routine, fill out the entries in the freq_table as follows. If a given sequence has taxonomic classification X, pull out the row for that taxid using slice and pass it along the sequence to add_counts_by_seq.

Make an float vector prior_counts such that the entry for word i is (n(w_i) + 0.5) / (N + 1), where n(w_i) is the total number of observations of w_i and N is the total number of sequences.

Then build a float Array2 called taxid_word_counts that is n_taxids by n_words, where the entry at (x,i) is log (m(w_i) + prior_counts[i]) - denom; m(w_i) is the number of sequences with word i for taxid x and denom = log (M + 1) where M is the number of sequences with taxid x.

Then, to classify a sequence, use add_counts_by_seq to get a vector of k-mer counts for that sequence. Multiply this vector by the taxid_word_counts matrix using Linear_utils.mat_vec_mul (into a pre-allocated vector) and find the index of the maximum value using Gsl_vector.max_index. The taxid corresponding to this index is our classification.

This classification should then get output via the previous classification machinery.

Bootstrapping should be implemented as follows.

Define a function fill_boot_row that takes an n_words long vector and repeats the following step floor(n_words/word_length) times: uniformly select an entry and add 1.0 to it.

Then initialize boot_matrix, which should be a 100 (rows) by n_words (columns) float matrix which is initialized to all zeros. Then slice out every row and apply fill_boot_row to it once. The result of this will be that boot_matrix will have random rows, each of which sum to floor(n_words/word_length).

Now, assume that we are given a vector seq_word_counts of word counts for a given sequence. To calculate the bootstrap confidence value for its classification C, repeat the following for every row boot_row of boot_matrix: put the pairwise product (see below) of boot_row and seq_word_counts into a booted_word_counts vector. Then classify booted_word_counts as before.

The different classifications obtained, along with their normalized frequencies, should then be stored analogous to the pplacer classifications with their likelihoods.

Here's the exceedingly simple vec_pairwise_prod_c. It's worth doing this in c. It needs to have an ocaml mli entry as well.

CAMLprim value vec_pairwise_prod_c(value dst_value, value x_value, value y_value)
{
  CAMLparam3(dst_value, x_value, y_value);
  double *dst = Data_bigarray_val(dst_value);
  double *x = Data_bigarray_val(x_value);
  double *y = Data_bigarray_val(y_value);
  int i;
  for(i=0; i < Bigarray_val(x_value)->dim[0]; i++) {
    dst[i] = x[i] * y[i];
  }
  CAMLreturn(Val_unit);
}

matsen / pplacer

guppy nbc #233