oschwengers / referenceseeker

Rapid determination of appropriate reference genomes.
https://doi.org/10.21105/joss.01994
GNU General Public License v3.0
88 stars 5 forks source link

Pre-built databases for other RefSeq sections #18

Closed Benjamin-Lee closed 3 years ago

Benjamin-Lee commented 3 years ago

For example, I'm interested in analyzing 1KP transcriptome data. It would be really nice to find the nearest reference genome for each plant transcriptome using this tool.

It doesn't appear that all of the RefSeq sections are represented in the down. Is this something that can be done? What about for all of the genomes in RefSeq? On a related not, is there a straightforward way to enter multiple genomes in a single fasta file at once?

oschwengers commented 3 years ago

Dear @Benjamin-Lee , thanks for your question. We've designed this tool in order to find suitable reference genomes for microbial genomes because, for some species, there are hundreds and thousands of good references in the public databases and sometimes, it's a non-trivial task to select the most appropriate one.

So far, I have no experience with neither plant genomes nor transcriptome data so I cannot come up with more than a general thought and educated guess. Having said that, in principle it should not be a problem to apply this tool to eukaryotic genomes. The workflow of the tool is solely based on DNA sequences without anything unique to microbial genomes. Also, the build scripts could easily be adopted to download and build, for instance, a RefSeq-based plant DB.

However, I currently cannot see an advantage as there are not so many plant genomes stored in the current RefSeq release. Shouldn't the plant reference genome be known in advance? But as I said, I have no experience with that sort of data and I might be missing something important here. Could you elaborate a little bit more on this use case or point to an example data set?

Regarding your 2nd question. Currently, there is no such feature implemented. Hence, the tool must be executed with each query genome as a distinct analysis. However, v1.7 will be the last feature release as we're working on a v2.0 version already covering that feature. However, it might still take quite a while to release this new version.

oschwengers commented 3 years ago

@Benjamin-Lee gentle ping Is this still active? Otherwise, I'd close this for now.