o-william-white / skim2mito

A snakemake pipeline for the batch assembly, annotation, and phylogenetic analysis of mitochondrial genomes from genome skims
MIT License
9 stars 5 forks source link

Which genbank database is needed? #8

Closed rvosa closed 1 year ago

rvosa commented 1 year ago

We have a genbank nr release that we timestamped 2022-03-16 and that we use as a blast database. Will that do? Or should we have a reference data set, maybe on Zenodo?

o-william-white commented 1 year ago

That will be fine in the meantime, it's just to get a rough idea of taxonomy of the sequence to identifiy obvious contaminants. Ultimately yes a reference database would be a good idea. The NCBI nt database is large and not all users will have access to it.

o-william-white commented 1 year ago

nr release

As it stands the script requires a nt database. I think nr is protein. Do you have a nt database available?

If not we could edit your Snakefile to run a blastx search instead?

rvosa commented 1 year ago

nr here refers to non-redundant, sorry. It's nucleotide so we can run blastn on it like in the Snakefile. I think we'll just try it out, might teach us something about reference data management.

o-william-white commented 1 year ago

Perfect, let me know how it goes! Thanks for all the feedback so far!

o-william-white commented 1 year ago

Hi @rvosa, I updated the pipeline so it now downloads smaller blast databases specific to mitochondrial and ribosomal sequences https://zenodo.org/records/8424777. Hopefully this helps if users do not have access to nt or nr databses already. Hope you are well