nhoffman / bioy

Tools for NGS sequence analysis and bacterial classification
GNU General Public License v3.0
0 stars 0 forks source link

classifier --split-condensed-assignments double-counts centroids #58

Closed tyleraland closed 8 years ago

tyleraland commented 8 years ago

I observed this in this report in my working directory (/mnt/disk2/molmicro/working/tland9/2016-02-26_swarm-n-more/report/142_13.html) which is subject to change, but you can take a peek to get an idea of what I mean.

^ In particular, look at MAD, assignment_id=10 and MAD, assignment_id=20. Notice that the assignment was split, but all of the qseqids are shared among both classifications. This indicates that all of the details rows for classification1 are also shared by classification2, but that each set of details rows has different assignment_keys (In that report, we use classification to key into details to get qseqid to key into centroids).

I suppose to resolve this, once classifications have been split by qseqid, then details need to be split by qseqid too.

crosenth commented 8 years ago

The bug was in mapping the 'assignment_id's from the classifications output to the details output where the 'assignment_hash' had been dropped in an earlier step. The fix was to retain the 'assignment_hash' until after the 'assignment_id's were put into the details output.