ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
81 stars 8 forks source link

egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype #52

Open bbista opened 2 days ago

bbista commented 2 days ago

Hello, I am getting an error in this particular process. The example data ran successfully but this error pops up when I use my genome assembly.

Best, B

Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype'

Caused by:
  Process `egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype` terminated with an error exit status (3)

Command executed:

  mkdir -p output
  mkdir -p ./asncache/
  prime_cache -cache ./asncache/ -ifmt asnb-seq-entry  -i swissprot.asnb -oseq-ids spids -split-sequences
  prime_cache -cache ./asncache/ -ifmt asnb-seq-entry  -i gnomon_wnode.out -oseq-ids gnids -split-sequences
  lds2_indexer -source genome/ -db LDS2 
  echo "hits.diamond.asn" > raw_blastp_hits.mft
  merge_blastp_hits -asn-cache ./asncache/ -nogenbank -lds2 LDS2 -input-manifest raw_blastp_hits.mft -o prot_hits.asn
  echo "gnomon_wnode.out" > models.mft
  echo "prot_hits.asn" > prot_hits.mft
  echo "" > splices.mft
  if [ -z "" ]
  then
    gnomon_biotype  -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2  -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices splices.mft  -reftrack-server 'NONE' -allow_lt631 true
  else
    gnomon_biotype  -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2  -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_denylist  -prot_hits prot_hits.mft -prot_splices splices.mft  -reftrack-server 'NONE' -allow_lt631 true
  fi

Command exit status:
  3

Command output:
  (empty)

Command error:
  Prefetching 3705 bioseqs
  Prefetching 4481 bioseqs
  Prefetching 4147 bioseqs
  Prefetching 4607 bioseqs
  Prefetching 5144 bioseqs
  Prefetching 3745 bioseqs
  Prefetching 1598 bioseqs
  Prefetching 875 bioseqs
  Prefetching 1187 bioseqs
  Prefetching 8936 bioseqs
  Prefetching 9302 bioseqs
  Prefetching 9602 bioseqs
  Prefetching 9390 bioseqs
  Prefetching 9332 bioseqs
  Prefetching 9471 bioseqs
  Prefetching 633 bioseqs
  Second-pass: computing bestness scores

  Starting.
  Fetching Gnomon model data.
  Loading GC-Assembly.
  Taxon is invertebrate or plant - will allow more coding models
  Loading protein hits
  Skipped 6230 protein hits without corresponding CDS features
  Processed 213495 hits; accepted 85835; 18278 are RBPH
  Loading protein data.
  Retrieving attributes for 35534 prots
  Fetching next batch of 10000
  Fetching next batch of 10000
  Fetching next batch of 10000
  Creating classifier.

  Classifier internal state for EGAPx Test Assembly:
    0:  43036/25=1721.44    43036/49=878.286
    1:  156227/209=747.498  156227/482=324.122
  M=[10 39; 15 338]; PPV=0.384615; NPV=0.89418; ACC=0.863524

  Allowing locusType-631 models: true
  Initialized 10 patterns for attr_rule=538.
  Initialized 36 patterns for attr_rule=489.
  Initialized 6 patterns for attr_rule=989.
  Initialized 11 patterns for attr_rule=986.
  Initialized 6 patterns for attr_rule=987.
  Initialized 5 patterns for attr_rule=988.
  Outputting.
  Initialized 70 patterns for attr_rule=869.
  BPH to proks: 19.3156%
  Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
  Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

Work dir:
  /project/meisel/users/bbista/Software/egapx/scratch/0f/e49e8c482f512a5dc8ed2638ca37af

Container:
  /project/meisel/users/bbista/Software/egapx/scratch/ncbi-egapx-0.3.0-alpha.img

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
murphyte commented 2 days ago

It looks like your run triggered a safeguard we have in the code to catch genomes with substantial bacteria contamination. This fraction of prokaryote hits suggests your genome has on the order of 10 Mbp of contamination. I'd recommend screening your genome for contamination with FCS-GX (https://github.com/ncbi/fcs) and rerunning.