nhoffman / bioy

Tools for NGS sequence analysis and bacterial classification
GNU General Public License v3.0
0 stars 0 forks source link

limit to one qseqid per classification in --details #47

Open nhoffman opened 8 years ago

nhoffman commented 8 years ago

We need to have consistent criteria for including a single qseqid for each assignment_id in the output.

crosenth commented 8 years ago

Agreed and I am assuming you are referring to the --details-summary single largest cluster qseqid.

For some history, we output up to two qseqids per --details-summary. The latest issue the --details-summary output is from 2014-11-26: https://bitbucket.org/uwlabmed/markergene_pipeline/issues/19/starred-genus-name-without-apparent-100

Summary: Because --max-group-size combines species level hits with genus level hits, --details-summary will output the largest assignment cluster or the two largest assignment clusters of the combined rank assignments. For example:

1) hits: qseqid 1 - assignment: species 1/2*/3/4 qseqid 2 - assignment: genus 1

2) --min-group-size 3 bumps qseqid 1 - assignment: genus 1 qseqid 1 - genus 1 qseqid 2 - genus 1

3) assignment becomes genus 1*

4) --details-summary outputs qseqids 1 + 2

The reason being sometimes the largest cluster is qseqid 2 so the details simply has one genus level assignment_tax_name. This is confusing because the assignment is starred and should contain at least one 100% hit to species 2.

Yes, we should simplify or redesign the --details-summary criteria.