Open dhoogest opened 2 years ago
Example accession NZ_CP056776_2309782_2311321
which user visualized in NGS16S validation here as lacking 'closest type' info:
dhoogest@naga:$ grep NZ_CP056776_2309782_2311321 /molmicro/common/ncbi/16s/output/20211004/dedup/1200bp/named/seqs.fasta | wc -l
0
No records for this allele in the 'named' set, its a duplicate allele of another for the NZ_CP056776 genome
Examining the 'trusted' set confirms the presence of the record seqs.fasta (which would also be the target for BLAST db used in the pipeline output where bug was detected).
dhoogest@naga:$ grep NZ_CP056776_2309782_2311321 /molmicro/common/ncbi/16s/output/20211004/dedup/1200bp/named/filtered/trusted/seqs.fasta | wc -l
1
I believe this if fixed right?
Looks like there's a bit of a circular issue emerging from interplay between definition of the
named
.fasta set (which has duplicate records within a genome dropped https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L496), and the logic which adds all type strain records back to the 'trusted' .fasta output (and BLAST db). The outcome of this is that the trusted BLASTdb contains dropped duplicate alleles for some seqs withinis_type
genomes, and these records lack info about the nearest type strain, since thenamed
fa is used as a target in https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L737Possible solutions:
The third option seems easiest implementation-wise.