Open Buuntu opened 5 years ago
Sorry, missed notification about your comment somehow. The script you need is HmmGen.py. Please note that it requires nhmmer search result as an input. -h describes the options
@nikolaichik Is the nhammer search result something that has to be done separately or is there a script for that? And does that leave out the other tools that SigmoID uses such as MAST?
yes, nhmmer is run separately like this: nhmmer --dna --max --nonull2 --cut_ga --tblout nhmmer.table RegR.hmm genome.gb
results in the table format are then fed into the script like this: HmmGen.py nhmmer.table genome.gb genome_annotated.gb -d -S 10.5 -i -b 50 -L 20 -p -n -f protein_bind -q bound_moiety#RegR inference#profile:nhmmer:3.2.1
A similar script MastGen.py is used with MAST output. We rarely use MAST because of an elusive bug with occasional seemingly random line breaks in MAST output which hamper its proper processing
@nikolaichik sorry for another dumb question but how do I generate the RegR.hmm file? Is that just an HMM for a specific gene?
Yes, a calibrated hmm with properly set cutoffs (RegR - a fictious name). Non-calibrated models (or just a fasta file with aligned TFBSs) could be used, in which case you drop --cut_ga, but have to think about setting cutoff via another switch (-T). Of course, the same cutoff should be used with HmmGen.py (-S option).
@nikolaichik Okay so there is not a list of TFBS sequences in calibrated HMMs and/or FASTA/TFBS alignment files that can be found somewhere for bacteria? I guess I can do the alignment with a list of TFBS sequences, but I have to generate the TFBS list myself?
The largest collection is RegPrecise (http://regprecise.lbl.gov/RegPrecise/) with over a thousand of inferred TFBSs. PRODORIC2 and CollecTF are much smaller and less convenient collections, but with experimentally characterised TFBSs. RegulonDB is the best for E. coli. (SigmoID has interfaces to RegPrecise, RegulonDB and CollecTF that allow to import the data from these DBs and perform searches straight away if you don't mind using GUI).
@nikolaichik Okay so I found the taxonomy I want to use and list of transcription factors from RegPrecise. Does it matter how I generate the alignment? ClustalW? I couldn't find an alignment option inside of HMMER so it looks like this has to be done separately as well.
When I run it through the GUI, does it automatically look at all fo the RegPrecise HMMs? All I have to do is load the genome and run "scan genome"? I didn't see an option to specify which taxonomy of HMMs to use.
With the GUI, I get an error because I have a genbank with many features already. I also tried to run the scan on the fasta file equivalent but got an error about it being malformed (with two accessions in the fasta file).
RegPrecise has aligned TBBS seqs for every TF (in fasta format)
"Scan genome" scans the genome only with profiles that are currently selected in the preferences. There is another ways to do what you probably want. It requires pre-release version of SigmoID (try 2.0dr2 from the "Releases" page). First, you need to export RegPrecise data. This can be done via the menu Regulon --> RegPrecise TF Families, then "Export Selected as .sig... button. This results in a folder with (very roughly) automatically calibrated profiles. You then can select this folder in the preferences (Settings button in the toolbar of the main menu). The folder selected here is the one used by "scan genome".
SigmoID was designed to work with genomes in GenBank format as you can't have genome annotation in fasta format. Scanning fasta files with nhmmer will work, but using the HmmGen script afterwards is meaningless, as it processes and modifies GenBank files only. Hence the error you are seeing.
Sorry if I missed something, but is there a command line tool for updating an existing annotation with promoter sequences? Say I have a Genbank annotation through NCBI's annotation pipeline and I want to add promoter regions.
I looked in the Python folder and didn't find a master script, it looks like those are just meant to convert the output of MAST, NHMMER, etc. to Genbank? Would I have to run each of these separately? There is a not a script that will run all of them together?