Post-Stutter Filtering Question

michaeldonaldson commented 4 years ago

Hello,

Thank you for this excellent tool! It is working very nicely on HiSeq/MiSeq data generated from amplicon sequencing of a non-model organism.

I wonder, after stutter filtering, is there an easy way to extract genotypes at loci with > 95% overall confidence and perhaps a full read threshold > 30 or > 50? I would like to apply some very conservative filtering for the retained profiles that are from di-nucleotide repeat motifs. I am working with > 250 samples and ~ 20 loci so this is quite a bit of work to do manually. Any help would be greatly appreciated.

Sincerely, Mike

marcelTBI commented 4 years ago

Hi,

thank you for using our tool. We are planning to expand the tool in following months, so every suggestion and/or issue report is more than welcome.

I am not entirely sure if I understand what exactly are you trying to do, but few output files that could help you are hidden (and undocumented - TODO) in the motif folder. For example, if you have motif called DMPK, in the final report directory you will have a folder called DMPK, where there are figures used in final report.html and files with counts and partial results (the number behind underscore is the index of the variable part - since Dante allows for complex motifs with multiple variable parts):

allcall_1.txt -- the prediction of the alleles with confidence (maybe the answer to the first part of your question?)
annotations_1.txt -- reads and their annotations with mutations/inserts/dels visualized in file format, line with 00000000000011111111112222222222222222 visualizes the annotation to parts (either flanking regions or variable parts)
repetitions_1.txt -- count of reads, that are annotated for the region (only full coverage - blue on the final figure ), first column is read count, then other columns are numbers of the repeats (1 for flanking regions), so line "14 1 12 1" means that we have 14 occurrences (annotated reads), where motif is annotated as 1 left flanking region, 12 repetitions of the variable region, and 1 right flanking region.
repetitions_grey_1.txt -- same as before, but here are those taht are not fully covered (grey on fiigures), 14 0 6 1 will mean that we have 14 occurrences, where motif is annotated with at least 6 variable regions and 1 right flanking region (no left flanking region, since the read starts on the variable part).

Unfortunately, currently there is no way of filtering of the motifs based on the criteria that you have noted, except of reading the mentioned files, parsing them, extracting the information and filter accordingly. I will keep this in mind and try to implement some motif filtering in the future, but this will take time..

I hope I answered your question, feel free to elaborate if not, I promise to answer more promptly.

Best, Marcel

michaeldonaldson commented 4 years ago

Hi Marcel,

Thank you for the comprehensive response. I'm working with a non-model organism and a mixture of sequence capture and amplicon sequencing data from HiSeq and MiSeq runs, respectively. Your tool works very well in extracting the microsatellites and calling the genotypes from these two different datasets!

We were hoping to generate genotypes to help inform population genetic structure so the end goal was to generate a STRUCTURE formatted file with the output from dante. However, I did want to filter the data in a conservative manner. To do so, I worked from the table.tsv files generated for each individual because it has all the information I needed; basically, I used grep to combine all the non-header lines from ~250 samples to get them into a single tsv file. Next, I was able to print the genotype from dante if the overall confidence for the locus was >95%, otherwise the genotype was changed to "-9". I also could have filtered for the full read threshold here but chose not to (but I suggest this is an important consideration as well). These new genotypes were used for STRUCTURE analysis.

I thought you'd like to know "how" your tool is being used because it's quite useful but there are some situations where users might want to further filter the data, or get it into a universal format for downstream analysis (STRUCTURE is pretty common). Oh, and it's worth noting that the resulting STRUCTURE plots using the dante genotypes are consistent with our predictions.

Sincerely, Mike

marcelTBI commented 4 years ago

Yeah, getting data into an universal format is quite important for researchers and it is not done at all in Dante (we were focusing on the overall html report for clinicians). I will leave this issue open and try to resolve it in future implementations in Dante.

Btw: thanks for informing about the uses of Dante and pointing out the STRUCTURE analysis tool and file format - I did not her about it before.

jbudis / dante

Post-Stutter Filtering Question #6