erelior commented 4 years ago

Hi! I created a custom bacteria database and imported some fasta files in it, which seemed to work: referenceseeker_db init --db DB referenceseeker_db import --db DB --genome "$file" --status complete --organism "$organism" -t $taxid (got a Successfully imported genome message for all)

but running referenceseeker for files (including ones contained in the DB) came back empty: referenceseeker -v DB e_coli.fna returned:

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

eliminating ani and conv. thresholds returned mash distance, but ani and conv.dna returned 0: referenceseeker -v -a 0 -c 0 DB e_coli.fna

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

NC_002695.2 0.01899 0.00 0.00 83334 complete Escerichia coli (changing crg value gave the same result)

I can't figure out why ani and conv.DNA values returns 0 for fasta files I know are identical/similar to reference genomes.

I am using a BioConda installation, version 1.6

(I attached a fasta file from the DB I used and the db files) referenceseeker_issue.zip

Thanks

oschwengers commented 4 years ago

Hi @erelior , thanks for reaching out and providing the files. A had a short look into it and apparently nucmer returns with an empty output. So I took a deep dive into your data and noticed that the genomic content in your NC_000117.1.fna file is duplicated:

grep '>' NC_000117.1.fna 
>NC_000117.1 Chlamydia trachomatis D/UW-3/CX chromosome, complete genome
>NC_000117.1 Chlamydia trachomatis D/UW-3/CX chromosome, complete genome

After cleaning your genome, nucmer is able to align its sequence as expected.

Could you please check all your db fasta files for these duplicated records, fix them and rebuild the database? Please, let me know if this helps.

erelior commented 4 years ago

Worked like a charm! @oschwengers thank you so much!

oschwengers commented 4 years ago

Thanks for the feedback. You're welcome!

oschwengers / referenceseeker

Empty results for a custom built DB #9

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism