nhoffman / ya16sdb

A curated subset of 16S rRNA sequences from NCBI
2 stars 3 forks source link

Records dropped as duplicates but then 'added back' as types are absent from named_type_hits output #45

Open dhoogest opened 2 years ago

dhoogest commented 2 years ago

Looks like there's a bit of a circular issue emerging from interplay between definition of the named .fasta set (which has duplicate records within a genome dropped https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L496), and the logic which adds all type strain records back to the 'trusted' .fasta output (and BLAST db). The outcome of this is that the trusted BLASTdb contains dropped duplicate alleles for some seqs within is_type genomes, and these records lack info about the nearest type strain, since the named fa is used as a target in https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L737

Possible solutions:

The third option seems easiest implementation-wise.

dhoogest commented 2 years ago

Example accession NZ_CP056776_2309782_2311321 which user visualized in NGS16S validation here as lacking 'closest type' info:

dhoogest@naga:$ grep NZ_CP056776_2309782_2311321 /molmicro/common/ncbi/16s/output/20211004/dedup/1200bp/named/seqs.fasta | wc -l 
0

No records for this allele in the 'named' set, its a duplicate allele of another for the NZ_CP056776 genome

Examining the 'trusted' set confirms the presence of the record seqs.fasta (which would also be the target for BLAST db used in the pipeline output where bug was detected).

dhoogest@naga:$ grep NZ_CP056776_2309782_2311321 /molmicro/common/ncbi/16s/output/20211004/dedup/1200bp/named/filtered/trusted/seqs.fasta | wc -l
1
crosenth commented 1 year ago

I believe this if fixed right?