pirovc / genome_updater

Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
MIT License
146 stars 15 forks source link

Feature request: allow wildcard filtering based on assembly name #75

Open jdwinkler-lanzatech opened 2 years ago

jdwinkler-lanzatech commented 2 years ago

Hi,

I was wondering if it would be possible to provide a filtering option based on assembly (species/assigned) name? I often want to pull a group of microbes with a general metabolic capabilities (say methanogenesis) but I have to manually pick out the TaxIDs currently to do so. Not a major problem, but the feature might be useful for other people too!

pirovc commented 2 years ago

Hi, thanks for the suggestion. genome_updater selects and filters data based on the assembly_summary.txt file provided by NCBI (more info https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt). Besides the filter parameters, the -F option allow custom filtering for data selection. However, I'm not sure the information you refer to is contained in that file.

jdwinkler-lanzatech commented 2 years ago

Column 8 would be the target, I think. I believe right now the -F option is an exact match though, so I am thinking of another flag that basically uses grep behind the scenes to implement the matching. I'd basically want to grab all the assemblies with an organism name matching "methano*", if that makes sense. Obviously would not be perfect, but could be handy if you have a specific enough search string.

pirovc commented 2 years ago

Partial matching should be doable, will mark it as enhancement. For now one can download the full assembly_summary.txt from genbank or refseq and apply the filter/grep manually and use the resulting file as an external assembly_summary.txt (param. -e).

jdwinkler-lanzatech commented 2 years ago

Great, thanks! I figure it is a logical addition to the custom filtering offered by -F already.