classif_table.py undercounts read tallies after reduplication

In the examples below, I created a data set that either underwent "real" deduplication (using deduplicate_sequences.py) or "fake" deduplication using this script:

#!/usr/bin/env python

"""
Usage seqmagick extract-ids seqs.fasta | pretend_to_deduplicate.py > dedup_info.csv
"""

import sys
import csv

def main():
    writer = csv.writer(sys.stdout)
    for name in sys.stdin:
        name = name.strip()
        writer.writerow([name, name, 1])

main()

In the examples below, the output using "real" deduplication is in output-dedup, "fake" in output.

% grep -c '>' data/seqs.fasta
12279
% src/pplacer/scripts/classif_table.py output/placements.db -m data/seq_info.csv by_taxon.species.csv
% R -q --no-save -e 'sum(read.csv("by_taxon.species.csv")$tally)'
> sum(read.csv("by_taxon.species.csv")$tally)
[1] 12279
% src/pplacer/scripts/classif_table.py output-dedup/placements.db -m data/seq_info.csv by_taxon.species.csv
% R -q --no-save -e 'sum(read.csv("by_taxon.species.csv")$tally)'
> sum(read.csv("by_taxon.species.csv")$tally)
[1] 10849

This manifests in an actual analysis as under-counting for each specimen, for example:

   specimen raw_tally tally yield
1   j14tr10      6560  3800  57.9
2   j14tr11      5719  4787  83.7
3   j14tr12      6201  5925  95.5
4   j14tr17      6674  3124  46.8
5   j14tr18      8023  3349  41.7
6   j14tr19      8206  3101  37.8
7   j14tr22      6349  2512  39.6
8   j14tr23      7627  3040  39.9
9   j14tr24      7509  2516  33.5

As far as I can tell, the behavior is the same in classif_rect.py (the original script from which classif_table.py was derived).

For the time being, the fix appears to be not to deduplicate before aligning and running pplacer.

matsen / pplacer

classif_table.py undercounts read tallies after reduplication #334