In the examples below, I created a data set that either underwent "real" deduplication (using deduplicate_sequences.py) or "fake" deduplication using this script:
#!/usr/bin/env python
"""
Usage seqmagick extract-ids seqs.fasta | pretend_to_deduplicate.py > dedup_info.csv
"""
import sys
import csv
def main():
writer = csv.writer(sys.stdout)
for name in sys.stdin:
name = name.strip()
writer.writerow([name, name, 1])
main()
In the examples below, the output using "real" deduplication is in output-dedup, "fake" in output.
In the examples below, I created a data set that either underwent "real" deduplication (using deduplicate_sequences.py) or "fake" deduplication using this script:
In the examples below, the output using "real" deduplication is in output-dedup, "fake" in output.
This manifests in an actual analysis as under-counting for each specimen, for example:
As far as I can tell, the behavior is the same in classif_rect.py (the original script from which classif_table.py was derived).
For the time being, the fix appears to be not to deduplicate before aligning and running pplacer.