mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

Feature request: Ability to run convertGenbank2table.py on multiple genomes #77

Closed ninjatacoshell closed 8 years ago

ninjatacoshell commented 8 years ago

Currently you must run convertGenbank2table.py on each genome, one at a time:

convertGenbank2table.py -g 11111.1.gbk -v 1

This is time-consuming, especially if you want to run ITEP on hundreds of genomes. When I try

convertGenbank2table.py -g *.1.gbk -v 1

It runs on only one of the .gbk files and ignores the rest. If convertGenbank2table.py recognized the * wildcard, it would save a lot of time. And this needn't cause problems with genomes that share a TaxID. They simply need to have different version numbers, which then lets us use:

convertGenbank2table.py -g *.2.gbk -v 2

Full disclosure: I'm running the VM version of ITEP and I had to reinstall Biopython to even get convertGenbank2table.py to run.

mattb112885 commented 8 years ago

You could try creating a for loop in BASH to do it for you. I don't have a Linux box in front of me but it would be something like this.

for file in *.1.gbk; do convertGenbank2Table.py -g "$file" -v 1; done

Matt

ninjatacoshell commented 8 years ago

Thanks for your response. I tried the for loop you suggested. It returned the following error:

IOError: [Errno 2] No such file or directory: '*.1.gbk'

I'm a novice when it comes to BASH, so any other suggestions would be much appreciated. And automating it, rather than making the user write a for loop, would still be preferred.

mattb112885 commented 8 years ago

Hello,

You would have to run it from the directory containing your genbank files (and may have to fully qualify the python file path, i.e. instead of convertGenbank2Table.py use "python [path to your convertGenbank2Table.py file]")

To be honest, if you aren't comfortable with BASH, you may have difficulty using ITEP since it is designed to interact with all the bash and linux tools to build workflows (and it may also blow up your computer with a few hundred genomes). Have you tried using LS-BSR or BPGA? I haven't used them but I do know they're faster.

Best

Matt

ninjatacoshell commented 8 years ago

That did it. (I actually figured this out on my own but forgot to come back and mention it.)

As for my use of the VM, running setup_step1.sh on twenty-three 5-Mb genomes took several days, so I definitely won't be trying to scale up on the VM. Besides the BLAST steps are there any other steps that could 'blow up' my computer? If BLAST (and maybe MCL) are the only concerns, how difficult would it be to run setup_step1.sh and setup_step2.sh on a supercomputing cluster and then transfer the database to the VM on my laptop for the remainder of the analyses?