Command line tool for adding promoter annotations to genbank? - Githubissues

nikolaichik / SigmoID

A Xojo/python tool for identification and annotation of transcription factor binding sites in bacterial genomes

GNU General Public License v3.0

16 stars 4 forks source link

Command line tool for adding promoter annotations to genbank? #20

Open Buuntu opened 5 years ago

Buuntu commented 5 years ago

Sorry if I missed something, but is there a command line tool for updating an existing annotation with promoter sequences? Say I have a Genbank annotation through NCBI's annotation pipeline and I want to add promoter regions.

I looked in the Python folder and didn't find a master script, it looks like those are just meant to convert the output of MAST, NHMMER, etc. to Genbank? Would I have to run each of these separately? There is a not a script that will run all of them together?

nikolaichik commented 5 years ago

Sorry, missed notification about your comment somehow. The script you need is HmmGen.py. Please note that it requires nhmmer search result as an input. -h describes the options

Buuntu commented 5 years ago

@nikolaichik Is the nhammer search result something that has to be done separately or is there a script for that? And does that leave out the other tools that SigmoID uses such as MAST?

nikolaichik commented 5 years ago

yes, nhmmer is run separately like this: nhmmer --dna --max --nonull2 --cut_ga --tblout nhmmer.table RegR.hmm genome.gb

results in the table format are then fed into the script like this: HmmGen.py nhmmer.table genome.gb genome_annotated.gb -d -S 10.5 -i -b 50 -L 20 -p -n -f protein_bind -q bound_moiety#RegR inference#profile:nhmmer:3.2.1

A similar script MastGen.py is used with MAST output. We rarely use MAST because of an elusive bug with occasional seemingly random line breaks in MAST output which hamper its proper processing

Buuntu commented 5 years ago

@nikolaichik sorry for another dumb question but how do I generate the RegR.hmm file? Is that just an HMM for a specific gene?

nikolaichik commented 5 years ago

Yes, a calibrated hmm with properly set cutoffs (RegR - a fictious name). Non-calibrated models (or just a fasta file with aligned TFBSs) could be used, in which case you drop --cut_ga, but have to think about setting cutoff via another switch (-T). Of course, the same cutoff should be used with HmmGen.py (-S option).

Buuntu commented 5 years ago

@nikolaichik Okay so there is not a list of TFBS sequences in calibrated HMMs and/or FASTA/TFBS alignment files that can be found somewhere for bacteria? I guess I can do the alignment with a list of TFBS sequences, but I have to generate the TFBS list myself?

nikolaichik commented 5 years ago

The largest collection is RegPrecise (http://regprecise.lbl.gov/RegPrecise/) with over a thousand of inferred TFBSs. PRODORIC2 and CollecTF are much smaller and less convenient collections, but with experimentally characterised TFBSs. RegulonDB is the best for E. coli. (SigmoID has interfaces to RegPrecise, RegulonDB and CollecTF that allow to import the data from these DBs and perform searches straight away if you don't mind using GUI).

Buuntu commented 5 years ago

@nikolaichik Okay so I found the taxonomy I want to use and list of transcription factors from RegPrecise. Does it matter how I generate the alignment? ClustalW? I couldn't find an alignment option inside of HMMER so it looks like this has to be done separately as well.

When I run it through the GUI, does it automatically look at all fo the RegPrecise HMMs? All I have to do is load the genome and run "scan genome"? I didn't see an option to specify which taxonomy of HMMs to use.

With the GUI, I get an error because I have a genbank with many features already. I also tried to run the scan on the fasta file equivalent but got an error about it being malformed (with two accessions in the fasta file).

nikolaichik commented 5 years ago

RegPrecise has aligned TBBS seqs for every TF (in fasta format)

"Scan genome" scans the genome only with profiles that are currently selected in the preferences. There is another ways to do what you probably want. It requires pre-release version of SigmoID (try 2.0dr2 from the "Releases" page). First, you need to export RegPrecise data. This can be done via the menu Regulon --> RegPrecise TF Families, then "Export Selected as .sig... button. This results in a folder with (very roughly) automatically calibrated profiles. You then can select this folder in the preferences (Settings button in the toolbar of the main menu). The folder selected here is the one used by "scan genome".

SigmoID was designed to work with genomes in GenBank format as you can't have genome annotation in fasta format. Scanning fasta files with nhmmer will work, but using the HmmGen script afterwards is meaningless, as it processes and modifies GenBank files only. Hence the error you are seeing.