Open nick-youngblut opened 1 year ago
Hi Nick, I had the same initial problems. But on the page where you download the databases, it says [prodigal](https://github.com/hyattpd/Prodigal) (version 2.6.1 tested)
[https://gutsmash.bioinformatics.nl/download.html]. Did you try that?
I am using version 2.6.3 for prodigal, but there's not much of a difference between 2.6.1 and 2.6.3.
@mmpust do you pre-generate the genbank files (with prodigal or otherwise) and provide those to gutsmash, or just provide genome fasta files?
I would rather use genbank files, so that there is no redundant gene calling among various steps in my pipeline (e.g., using prodigal output for many downstream analyses).
I agree. prodigal 2.6.1 or 2.6.3 should not make the difference. I am providing FASTA files now.
The GBK files generated by gutSMASH have sequences, see:
LOCUS 886e493278 1827 bp DNA UNK 01-JAN-1980
DEFINITION 886e493278.
ACCESSION 886e493278
VERSION 886e493278
KEYWORDS .
SOURCE
ORGANISM .
.
COMMENT ##gutSMASH-Data-START##
Version :: 1.0.0-1555cd7(changed)
Run date :: 2023-05-10 19:25:05
##gutSMASH-Data-END##
##gutSMASH-Data-START##
Version :: 1.0.0-1555cd7(changed)
Run date :: 2023-05-19 19:32:03
##gutSMASH-Data-END##
FEATURES Location/Qualifiers
CDS 1..1827
/locus_tag="ctg1_1"
/transl_table=11
/translation="MLRVYHSNRLDVLEALMEFIVERERLDDPFEPEMILVQSTGMAQW
LQMTLSQKFGIAANIDFPLPASFIWDMFVRVLPEIPKESAFNKQSMSWKLMTLLPQLLE
REDFTLLRHYLTDDSDKRKLFQLSSKAADLFDQYLVYRPDWLAQWETGHLVEGLGEAQA
WQAPLWKALVEYTDELGQPRWHRANLYQRFIETLESATTCPPGLPSRVFICGISALPPV
YLQALQALGKHIEIHLLFTNPCRYYWGDIKDPAYLAKLLTRQRRHSFEDRELPLFRDSE
NAGQLFNSDGEQDVGNPLLASWGKLGRDYIYLLSDLESSQELDAFVDVTPDNLLHNIQS
DILELENRAVAGVNIEEFSRSDNKRPLDPLDSSITFHVCHSPQREVEVLHDRLLAMLEE
DPTLTPRDIIVMVADIDSYSPFIQAVFGSAPADRYLPYAISDRRARQSHPVLEAFISLL
SLPDSRFVSEDVLALLDVPVLAARFDITEEGLRYLRQWVNESGIRWGIDDDNVRELELP
ATGQHTWRFGLTRMLLGYAMESAQGEWQSVLPYDESSGLIAELVGHLASLLMQLNIWRR
GLRRSVRWKSGCRFVAICSTPFSCRMRKPKRR"
ORIGIN
1 atgttaaggg tctaccattc caatcgtctg gacgtgctgg aagcgttgat ggagtttatt
61 gtcgaacgcg aacgactgga cgatcctttc gaaccagaga tgattctggt gcaaagtact
121 ggtatggcac agtggctgca aatgaccctg tcgcaaaagt ttggtattgc ggcaaacatt
181 gattttccgc tgccagcgag ctttatctgg gatatgttcg tccgggtatt accggagatc
241 cccaaagaga gcgcctttaa caaacagagc atgagctgga aactgatgac tctgctgccg
301 caactgttgg agcgcgaaga ctttaccctg ttgcggcatt atctgactga cgatagtgac
361 aagcgaaaac tgttccagct ttcttcaaaa gcggcggacc tgtttgacca gtatctggtc
421 tatcgtccgg actggctggc acagtgggaa acaggacatc tggtagaagg gttgggagaa
481 gcacaggcct ggcaagcgcc gttgtggaag gcgttggtgg aatataccga cgaacttggg
541 caaccgcgct ggcaccgcgc caatctctat cagcgcttta tcgaaacgct ggagtccgcg
601 acgacctgcc cgccggggtt accttcgcgc gtctttatat gcggtatttc cgcgttaccg
661 cctgtttatc tccaggcgct acaggcgctg ggtaaacata ttgaaatcca tctcctgttt
721 accaacccct gccgttatta ctggggcgac attaaagatc cagcttatct ggcgaaacta
781 ctgactcgcc agcgccgaca cagttttgaa gatcgcgaat taccgctatt tcgcgacagc
841 gaaaatgccg ggcagctctt taacagcgat ggtgaacagg atgtcggcaa cccgctgctg
901 gcttcatggg gcaagcttgg gcgcgactac atttatctcc tttctgacct ggagagcagc
961 caggagctgg acgcttttgt cgatgtgacg ccagataacc tgctgcataa tattcagtct
1021 gacattctgg aactggaaaa ccgcgccgtt gctggtgtga acatcgaaga gttttcccgt
1081 agcgataaca aacgcccgct tgatccactg gatagcagta tcaccttcca cgtttgccat
1141 agcccgcagc gtgaagttga agttttacac gatcgcctgc tggcgatgct ggaggaagac
1201 ccgacactta ctccgcgcga catcatcgtg atggtggctg atatcgacag ctacagtccg
1261 tttattcagg ctgtgtttgg tagcgcacct gcggatcgtt acctgcctta cgccatttcc
1321 gaccgtcggg cgcggcagtc gcatcctgta cttgaagcgt ttatcagcct gttatcgctg
1381 ccagacagcc gctttgtgtc ggaagacgtg ctggcattac tggatgtgcc ggtgctggca
1441 gcgcggtttg acatcaccga agaagggctg cgttatttac gtcagtgggt caacgaatcc
1501 ggaattcgtt gggggataga tgacgacaac gttcgcgagc tggaacttcc cgctaccggt
1561 caacacacct ggcggtttgg cctgacgcgc atgttgctgg gctacgcgat ggagagcgcg
1621 cagggcgagt ggcaatcggt tctaccttat gatgaatcga gcggcttaat tgcagaactg
1681 gtggggcatc tggcttcact gctaatgcag ctaaatatct ggcgtcgcgg gctgcgcagg
1741 agcgtccgct ggaagagtgg ttgccggttt gtcgcgatat gctcaacgcc tttttcctgc
1801 cggatgcgga aaccgaagcg gcgatga
Do you specify prodigals' output format directly with either of:
-f, --output_format: Specify output format.
gbk: Genbank-like format (Default)
gff: GFF format
sqn: Sequin feature table format
sco: Simple coordinate output
I would assume that the default is already correct to get sequences. However, at the first glance in antismash/detection/genefinding/run_prodigal.py
(line 36) they seem to choose the simple coordinate output by default but maybe its overwritten later on?
# run prodigal
prodigal = [path.join(basedir, 'prodigal')]
prodigal.extend(['-i', fasta_file, '-f', 'sco', '-o', result_file])
if options.genefinding_tool == "prodigal-m" or len(record.seq) < 20000:
prodigal.extend(['-p', 'meta'])
they seem to choose the simple coordinate output
I saw that too. It appears that the sco output is actually used by gutsmash, so I don't see why that isn't an acceptable input -- besides the fact that gutsmash just follows the cli of antismash.
Yes! But then, the gutSMASH output has full gbk format, not sco (see my preview above). That's why I thought the prodigal output format may be re-specified from sco to gbk somewhere in the downstream process but I did not check.
It appears that gutSMASH requires a genbank that includes sequence data, given that if one provides a genbank lacking sequences (the default output from prodigal), one gets the error:
Which versions of prodigal actually works with gutSMASH? I'm assuming one can use at least some version of prodigal to generate the input, given that prodigal is a dependency of gutSMASH.