stenglein-lab / tick_surveillance

Bioinformatics pipeline for analysis of amplicon sequencing of tick-associated microbes
Apache License 2.0
3 stars 6 forks source link

Surveillance target behavior with respect to species-level assignments #93

Open stenglein-lab opened 5 months ago

stenglein-lab commented 5 months ago

CDC colleagues reported somewhat unexpected behavior for positive/negative calls in some surveillance results:

image

Here, you can see that this sample (D-18 in this run) was called positive for surveillance target Borrelia_sp but there are no names in the Borrelia_sp_other column, which would normally list the names of the Borrelia species that contributed to this positive call.

So there is a mismatch between the pos/neg Borrelia_sp call (Positive) and the Borrelia_sp_other names column (nothing listed).

Looking at the all_data tab of the output, you can see these abundances for sample D-18:

image

There are 2 Borrelia ref seqs that map to the Borrelia_sp and Borrelia_sp_other reporting columns: Bor_burgdorferi_CP017201 and Bor_SCGT_10_AF264895, which have 48 and 30 reads in this dataset.

Here is the part of targets.tsv (v107) that maps those refseqs to those reporting columns (some rows hidden):

image

The issue is that the sum of those targets is >50, which is the minimum count for making a positive call but neither is individually > 50, so their names do not show up in the the Borrelia_sp_other names column.

The way these columns are populated can be seen in these code snippets. First, for count type columns in the surveilance table:

image

And for name type columns:

image

So for named columns, the individual targets have to be called positive for a species name to appear. But for count-type columns the sum of the counts for targets that contribute to that column is used for pos/neg calls.

In our discussions, the idea was brought up to use the summed read counts for each species to decide whether to include a name or not in a name-type column.

Some notes about this:

Need to discuss more with CDC folks.