victoriapascal / gutsmash

gutSMASH
GNU Affero General Public License v3.0
52 stars 15 forks source link

no valid records found in file genome.gbk #12

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

It appears that gutSMASH requires a genbank that includes sequence data, given that if one provides a genbank lacking sequences (the default output from prodigal), one gets the error:

INFO     23/05 00:07:20   diamond using executable: /opt/conda/bin/diamond (0.9.19)
  INFO     23/05 00:07:20   hmmpfam2 using executable: /opt/conda/bin/hmmpfam2 (2.3.2)
  INFO     23/05 00:07:20   fasttree using executable: /opt/conda/bin/fasttree
  INFO     23/05 00:07:20   hmmsearch using executable: /opt/conda/bin/hmmsearch (3.3.2)
  INFO     23/05 00:07:20   hmmpress using executable: /opt/conda/bin/hmmpress (3.3.2)
  INFO     23/05 00:07:20   hmmscan using executable: /opt/conda/bin/hmmscan (3.3.2)
  INFO     23/05 00:07:20   glimmerhmm using executable: /opt/conda/bin/glimmerhmm
  INFO     23/05 00:07:20   prodigal using executable: /opt/conda/bin/prodigal (V2.6.3)
  INFO     23/05 00:07:20   muscle using executable: /opt/conda/bin/muscle (v3.8.1551)
  INFO     23/05 00:07:20   blastp using executable: /opt/conda/bin/blastp (2.13.0+)
  INFO     23/05 00:07:20   makeblastdb using executable: /opt/conda/bin/makeblastdb (2.13.0+)
  INFO     23/05 00:07:20   /tmp/gutsmash/antismash/detection/gut_hmm_detection/data/bgc_seeds.hmm components missing or obsolete, re-pressing database
  INFO     23/05 00:07:21   Parsing input sequence 'genome.gbk'
  ERROR    23/05 00:07:21   no valid records found in file genome.gbk

Which versions of prodigal actually works with gutSMASH? I'm assuming one can use at least some version of prodigal to generate the input, given that prodigal is a dependency of gutSMASH.

mmpust commented 1 year ago

Hi Nick, I had the same initial problems. But on the page where you download the databases, it says [prodigal](https://github.com/hyattpd/Prodigal) (version 2.6.1 tested) [https://gutsmash.bioinformatics.nl/download.html]. Did you try that?

nick-youngblut commented 1 year ago

I am using version 2.6.3 for prodigal, but there's not much of a difference between 2.6.1 and 2.6.3.

@mmpust do you pre-generate the genbank files (with prodigal or otherwise) and provide those to gutsmash, or just provide genome fasta files?

I would rather use genbank files, so that there is no redundant gene calling among various steps in my pipeline (e.g., using prodigal output for many downstream analyses).

mmpust commented 1 year ago

I agree. prodigal 2.6.1 or 2.6.3 should not make the difference. I am providing FASTA files now.
The GBK files generated by gutSMASH have sequences, see:

LOCUS       886e493278              1827 bp    DNA              UNK 01-JAN-1980
DEFINITION  886e493278.
ACCESSION   886e493278
VERSION     886e493278
KEYWORDS    .
SOURCE      
  ORGANISM  .
            .
COMMENT     ##gutSMASH-Data-START##
            Version  :: 1.0.0-1555cd7(changed)
            Run date :: 2023-05-10 19:25:05
            ##gutSMASH-Data-END##
            ##gutSMASH-Data-START##
            Version      :: 1.0.0-1555cd7(changed)
            Run date     :: 2023-05-19 19:32:03
            ##gutSMASH-Data-END##
FEATURES             Location/Qualifiers
     CDS             1..1827
                     /locus_tag="ctg1_1"
                     /transl_table=11
                     /translation="MLRVYHSNRLDVLEALMEFIVERERLDDPFEPEMILVQSTGMAQW
                     LQMTLSQKFGIAANIDFPLPASFIWDMFVRVLPEIPKESAFNKQSMSWKLMTLLPQLLE
                     REDFTLLRHYLTDDSDKRKLFQLSSKAADLFDQYLVYRPDWLAQWETGHLVEGLGEAQA
                     WQAPLWKALVEYTDELGQPRWHRANLYQRFIETLESATTCPPGLPSRVFICGISALPPV
                     YLQALQALGKHIEIHLLFTNPCRYYWGDIKDPAYLAKLLTRQRRHSFEDRELPLFRDSE
                     NAGQLFNSDGEQDVGNPLLASWGKLGRDYIYLLSDLESSQELDAFVDVTPDNLLHNIQS
                     DILELENRAVAGVNIEEFSRSDNKRPLDPLDSSITFHVCHSPQREVEVLHDRLLAMLEE
                     DPTLTPRDIIVMVADIDSYSPFIQAVFGSAPADRYLPYAISDRRARQSHPVLEAFISLL
                     SLPDSRFVSEDVLALLDVPVLAARFDITEEGLRYLRQWVNESGIRWGIDDDNVRELELP
                     ATGQHTWRFGLTRMLLGYAMESAQGEWQSVLPYDESSGLIAELVGHLASLLMQLNIWRR
                     GLRRSVRWKSGCRFVAICSTPFSCRMRKPKRR"
ORIGIN
        1 atgttaaggg tctaccattc caatcgtctg gacgtgctgg aagcgttgat ggagtttatt
       61 gtcgaacgcg aacgactgga cgatcctttc gaaccagaga tgattctggt gcaaagtact
      121 ggtatggcac agtggctgca aatgaccctg tcgcaaaagt ttggtattgc ggcaaacatt
      181 gattttccgc tgccagcgag ctttatctgg gatatgttcg tccgggtatt accggagatc
      241 cccaaagaga gcgcctttaa caaacagagc atgagctgga aactgatgac tctgctgccg
      301 caactgttgg agcgcgaaga ctttaccctg ttgcggcatt atctgactga cgatagtgac
      361 aagcgaaaac tgttccagct ttcttcaaaa gcggcggacc tgtttgacca gtatctggtc
      421 tatcgtccgg actggctggc acagtgggaa acaggacatc tggtagaagg gttgggagaa
      481 gcacaggcct ggcaagcgcc gttgtggaag gcgttggtgg aatataccga cgaacttggg
      541 caaccgcgct ggcaccgcgc caatctctat cagcgcttta tcgaaacgct ggagtccgcg
      601 acgacctgcc cgccggggtt accttcgcgc gtctttatat gcggtatttc cgcgttaccg
      661 cctgtttatc tccaggcgct acaggcgctg ggtaaacata ttgaaatcca tctcctgttt
      721 accaacccct gccgttatta ctggggcgac attaaagatc cagcttatct ggcgaaacta
      781 ctgactcgcc agcgccgaca cagttttgaa gatcgcgaat taccgctatt tcgcgacagc
      841 gaaaatgccg ggcagctctt taacagcgat ggtgaacagg atgtcggcaa cccgctgctg
      901 gcttcatggg gcaagcttgg gcgcgactac atttatctcc tttctgacct ggagagcagc
      961 caggagctgg acgcttttgt cgatgtgacg ccagataacc tgctgcataa tattcagtct
     1021 gacattctgg aactggaaaa ccgcgccgtt gctggtgtga acatcgaaga gttttcccgt
     1081 agcgataaca aacgcccgct tgatccactg gatagcagta tcaccttcca cgtttgccat
     1141 agcccgcagc gtgaagttga agttttacac gatcgcctgc tggcgatgct ggaggaagac
     1201 ccgacactta ctccgcgcga catcatcgtg atggtggctg atatcgacag ctacagtccg
     1261 tttattcagg ctgtgtttgg tagcgcacct gcggatcgtt acctgcctta cgccatttcc
     1321 gaccgtcggg cgcggcagtc gcatcctgta cttgaagcgt ttatcagcct gttatcgctg
     1381 ccagacagcc gctttgtgtc ggaagacgtg ctggcattac tggatgtgcc ggtgctggca
     1441 gcgcggtttg acatcaccga agaagggctg cgttatttac gtcagtgggt caacgaatcc
     1501 ggaattcgtt gggggataga tgacgacaac gttcgcgagc tggaacttcc cgctaccggt
     1561 caacacacct ggcggtttgg cctgacgcgc atgttgctgg gctacgcgat ggagagcgcg
     1621 cagggcgagt ggcaatcggt tctaccttat gatgaatcga gcggcttaat tgcagaactg
     1681 gtggggcatc tggcttcact gctaatgcag ctaaatatct ggcgtcgcgg gctgcgcagg
     1741 agcgtccgct ggaagagtgg ttgccggttt gtcgcgatat gctcaacgcc tttttcctgc
     1801 cggatgcgga aaccgaagcg gcgatga

Do you specify prodigals' output format directly with either of:

-f, --output_format:  Specify output format.
                          gbk:  Genbank-like format (Default)
                          gff:  GFF format
                          sqn:  Sequin feature table format
                          sco:  Simple coordinate output

I would assume that the default is already correct to get sequences. However, at the first glance in antismash/detection/genefinding/run_prodigal.py (line 36) they seem to choose the simple coordinate output by default but maybe its overwritten later on?


# run prodigal 
        prodigal = [path.join(basedir, 'prodigal')]
        prodigal.extend(['-i', fasta_file, '-f', 'sco', '-o', result_file])
        if options.genefinding_tool == "prodigal-m" or len(record.seq) < 20000:
            prodigal.extend(['-p', 'meta'])
nick-youngblut commented 1 year ago

they seem to choose the simple coordinate output

I saw that too. It appears that the sco output is actually used by gutsmash, so I don't see why that isn't an acceptable input -- besides the fact that gutsmash just follows the cli of antismash.

mmpust commented 1 year ago

Yes! But then, the gutSMASH output has full gbk format, not sco (see my preview above). That's why I thought the prodigal output format may be re-specified from sco to gbk somewhere in the downstream process but I did not check.