Closed rvosa closed 1 year ago
That will be fine in the meantime, it's just to get a rough idea of taxonomy of the sequence to identifiy obvious contaminants. Ultimately yes a reference database would be a good idea. The NCBI nt database is large and not all users will have access to it.
nr release
As it stands the script requires a nt database. I think nr is protein. Do you have a nt database available?
If not we could edit your Snakefile to run a blastx search instead?
nr here refers to non-redundant, sorry. It's nucleotide so we can run blastn on it like in the Snakefile. I think we'll just try it out, might teach us something about reference data management.
Perfect, let me know how it goes! Thanks for all the feedback so far!
Hi @rvosa, I updated the pipeline so it now downloads smaller blast databases specific to mitochondrial and ribosomal sequences https://zenodo.org/records/8424777. Hopefully this helps if users do not have access to nt or nr databses already. Hope you are well
We have a genbank nr release that we timestamped 2022-03-16 and that we use as a blast database. Will that do? Or should we have a reference data set, maybe on Zenodo?