merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Adding GTDB v214.1 #2105

Closed ivagljiva closed 1 year ago

ivagljiva commented 1 year ago

GTDB v214 was requested by one of our users on Discord, so I've added it to our available GTDB versions (and made it the default). Since none of the SCG names appear to have changed on the GTDB end of things, this was fairly straightforward.

The one annoying caveat is that GTDB slightly changed the structure of their representative sequence archives so that the FASTA files are now contained inside an inner folder called 'individual'. To make it compatible with the code for previous versions, I moved the FASTAs one directory level up:

if self.ctx.target_database_release == 'v214.1':
                    inner_path = os.path.join(self.ctx.msa_individual_genes_dir_path, 'individual')
                    for file in  glob.glob(inner_path + '/*.faa'):
                     shutil.move(file, self.ctx.msa_individual_genes_dir_path)
                    os.rmdir(inner_path)

In hindsight, it perhaps would have been much simpler to simply append the inner directory to the self.ctx.msa_individual_genes_dir_path variable and move on. šŸ¤” OH WELL. I will happily change it if even one person says 'that sounds better' to me. :)

Regardless, it will be a bit annoying if the next release of GTDB also has this new archive structure, because then we will have to remember to update our if statement. Formatting inconsistencies are the mosquitoes of data download: typically harmless, but they make your life just a little bit worse. šŸ™ƒ

meren commented 1 year ago

thank you so much for this, @ivagljiva šŸŽ‰