matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
75 stars 18 forks source link

guppy nbc #233

Closed matsen closed 12 years ago

matsen commented 12 years ago

nbc = Naive Bayes Classifier.

The code here should be factored as usual, because eventually we are going to be doing an NBC + phylogenetic classification hybrid.

Note that there is some chance in the future we will want to change to a 32-bit int for freq_table, or even a float. I will use "vector" to mean Bigarray.Array1. There is some reasonable code to play with these things in Gsl_vector.

Flags include:

-c              Ref package
--word-length   The length of the word used to do the NBC classification (Default 8)
--target-rank   The desired most specific rank for classification (Default genus)

Let n_words = 4^word_length. Assert that n_words < max_int.

We will denote the corresponding variables with underscores in place of dashes.

Start by grouping the reference sequences at the given target rank, discarding those that do not have a classification at that rank. Say there are n_taxids taxids represented by at least one sequence at the target rank.

Allocate an int bigarray.Array2 freq_table of width n_words and height n_taxids with all zero entries.

We will need a subroutine word_to_int to turn an [ACGT]{word_length} into an integer. This will work by turning every A, C, G, and T into a pair of bits like

A -> 0 0
C -> 0 1
G -> 1 0
T -> 1 1

Given a word, we concatenate these bits together into an integer. Please make an inverse function and then test round-tripping.

Using this routine, build another add_counts_by_seq that takes a sequence and an int big-vector of length n_words and marches along its de-gapped version, finding words and adding to the corresponding entry. Specifically, for every word W of the ungapped sequence, increase the corresponding entry of the vector by one. For the time being, ignore words that are not exclusively composed of {A,C,G,T}.

Using this routine, fill out the entries in the freq_table as follows. If a given sequence has taxonomic classification X, pull out the row for that taxid using slice and pass it along the sequence to add_counts_by_seq.

Make an float vector prior_counts such that the entry for word i is (n(w_i) + 0.5) / (N + 1), where n(w_i) is the total number of observations of w_i and N is the total number of sequences.

Then build a float Array2 called taxid_word_counts that is n_taxids by n_words, where the entry at (x,i) is log (m(w_i) + prior_counts[i]) - denom; m(w_i) is the number of sequences with word i for taxid x and denom = log (M + 1) where M is the number of sequences with taxid x.

Then, to classify a sequence, use add_counts_by_seq to get a vector of k-mer counts for that sequence. Multiply this vector by the taxid_word_counts matrix using Linear_utils.mat_vec_mul (into a pre-allocated vector) and find the index of the maximum value using Gsl_vector.max_index. The taxid corresponding to this index is our classification.

This classification should then get output via the previous classification machinery.

matsen commented 12 years ago

Bootstrapping should be implemented as follows.

Define a function fill_boot_row that takes an n_words long vector and repeats the following step floor(n_words/word_length) times: uniformly select an entry and add 1.0 to it.

Then initialize boot_matrix, which should be a 100 (rows) by n_words (columns) float matrix which is initialized to all zeros. Then slice out every row and apply fill_boot_row to it once. The result of this will be that boot_matrix will have random rows, each of which sum to floor(n_words/word_length).

Now, assume that we are given a vector seq_word_counts of word counts for a given sequence. To calculate the bootstrap confidence value for its classification C, repeat the following for every row boot_row of boot_matrix: put the pairwise product (see below) of boot_row and seq_word_counts into a booted_word_counts vector. Then classify booted_word_counts as before.

The different classifications obtained, along with their normalized frequencies, should then be stored analogous to the pplacer classifications with their likelihoods.

Here's the exceedingly simple vec_pairwise_prod_c. It's worth doing this in c. It needs to have an ocaml mli entry as well.

CAMLprim value vec_pairwise_prod_c(value dst_value, value x_value, value y_value)
{
  CAMLparam3(dst_value, x_value, y_value);
  double *dst = Data_bigarray_val(dst_value);
  double *x = Data_bigarray_val(x_value);
  double *y = Data_bigarray_val(y_value);
  int i;
  for(i=0; i < Bigarray_val(x_value)->dim[0]; i++) {
    dst[i] = x[i] * y[i];
  }
  CAMLreturn(Val_unit);
}
habnabit commented 12 years ago

Superseded by #250.