ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
294 stars 89 forks source link

[BUG] PGAP analysis generates all files except .aa and .gbk #276

Closed Chahrazadt87 closed 5 months ago

Chahrazadt87 commented 8 months ago


I have been having an issue with a couple of genomes where the results do not contain all of the output files especially, the annot.aa and the .gbk files. I have looked at the log files and the only error I can see is below. Does anyone have an idea of how to fix this please?

Error: error processing job: (CFileException::eFileIO) Error opening checkm dombtblout: /pgap/output/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/checkm.out terminate called after throwing an instance of 'ncbi::CException' what(): NCBI C++ Exception: Error: LIB(CException::eUnknown) "/export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/internal/gpipe/gpexec/queue/lib/wn_app.cpp", line 411: ncbi::CGPX_WorkerApp::Run() --- 1 jobs failed Stack trace: /panfs/ /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/internal/gpipe/gpexec/queue/lib/wn_app.cpp:409 ncbi::CGPX_WorkerApp::Run() offset=0x0 addr=0x7f098f4f229f /panfs/ :0 offset=0x0 addr=0x41a430 /panfs/ /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp:711 ncbi::CNcbiApplicationAPI::x_TryMain(ncbi::EAppDiagStream, char const, int, bool) offset=0x0 addr=0x7f097674a132 /panfs/ /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp:1023 ncbi::CNcbiApplicationAPI::AppMain(int, char const const, char const const, ncbi::EAppDiagStream, char const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) offset=0x0 addr=0x7f097674d78c /panfs/ :0 offset=0x0 addr=0x40bc82 /usr/lib64/ :0 offset=0x0 addr=0x7f0975176554 /panfs/ :0 offset=0x0 addr=0x40be59

azat-badretdin commented 8 months ago

Thank you for your report, user @Chahrazadt87 !

Could you please a larger portion of the log? We would like to see something that ends with first permanentFail and starts with last line that contains INFO...... <some path>$ <application_name> \


azat-badretdin commented 8 months ago

And if you can post the whole log file, that would be even better!

Chahrazadt87 commented 8 months ago

Hi Azat,

Thank you for your response. I have attached the partial log file (too large to attach all of it). Please be aware that I only get the missing output files once I add --debug to the command. Many thanks, Chahrazad cwltool.log

azat-badretdin commented 8 months ago

Thanks. Looks like you were trying to upload the whole log, but it seems that initial portion of the file is still missing.

Chahrazadt87 commented 8 months ago

The file is too large to upload I'm afraid. My file is 30MB and the limit is 25MB

azat-badretdin commented 8 months ago

Could you please post the command line?

Chahrazadt87 commented 8 months ago

Sure: ./ -r -o Documents/Halorubrum_SS5_8/SS5_8_results Documents/Halorubrum_SS5_8/SS5_8.yaml

azat-badretdin commented 8 months ago

I see that you are using "old school" method of supplying user information via YAML. Could you please post that YAML file as well?


Chahrazadt87 commented 8 months ago

Please be aware that the following works for all other strains. Only a handful fail.

topology: 'circular' location: 'chromosome' organism: genus_species: 'Halorubrum sp. SS5-8' strain: 'my_strain' contact_info: last_name: 'Warnecke' first_name: 'Tobias' email: '' organization: 'MRC London Institute of Medical Sciences' department: 'Molecular Systems Group' street: 'Du Cane Rd' city: 'London' postal_code: 'W12 0NN' state: 'Greater London' country: 'United Kingdom' authors:

azat-badretdin commented 8 months ago

Thanks, Chahrazad!

I got the genome species:

$ gettax -dates 'Halorubrum sp. SS5-8'

    scientific name: Halorubrum sp. SS5-8
             tax id: 1089755
      parent tax id: 2642239
             gb_div: Bacteria
               rank: species
            lineage: Archaea; Euryarchaeota; Stenosarchaea group; Halobacteria;
                     Haloferacales; Haloferacaceae; Halorubrum
              id_gc: 11
            name_gc: Bacterial, Archaeal and Plant Plastid
             id_mgc: 0
           name_mgc: Unspecified
           crt_date: 2011/09/26 14:32:50
           upd_date: 2011/10/23 17:33:10
           pub_date: 2011/10/22 18:00:26

So we can eliminate the taxonomic novelty factor here in checkm failure. Another suspicious factor is that it is Archaeal. But that also did not work out as a culprit: checkm data, no matter how old it is (2015) does have plenty of that taxonomic lineage in the database.

Upon closer examination of cwltool.log file you posted I stumbled upon the error message that I missed previously:

Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 315, in _bootstrap
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 591, in __init__
PermissionError: [Errno 1] Operation not permitted
Traceback (most recent call last):
  File "/root/venv/bin/checkm", line 856, in <module>
  File "/root/venv/lib/python3.9/site-packages/checkm/", line 992, in parseOptions
  File "/root/venv/lib/python3.9/site-packages/checkm/", line 326, in analyze
    binIdToModels = mgf.find(binFiles,
  File "/root/venv/lib/python3.9/site-packages/checkm/", line 68, in find
    binIdToModels = mp.Manager().dict()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 57, in Manager
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 558, in start
    self._address = reader.recv()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/", line 383, in _recv
    raise EOFError

I think we have enough information now to try reproduce this at our side. I suppose you are using the latest version of PGAP package, correct?

Chahrazadt87 commented 8 months ago

Hi Azat,

Yes I am. Everything is up to date software wise.

Many thanks,


azat-badretdin commented 8 months ago

We will try to look at this ASAP.

Chahrazadt87 commented 8 months ago

Good morning Azat,

Any news on this issue please?

Many thanks,


azat-badretdin commented 8 months ago

We started working on this, Chahrazad.

Could you please tarball and post the contents of the directory */tmp-outdir/usg1w98j? Thanks!

azat-badretdin commented 8 months ago

I was not able to reproduce your results with the same species, Chahrazad. For input data I put a similar species from the same genome and I have not got anything

azat-badretdin commented 8 months ago

Is your input by any chance a single plasmid?

azat-badretdin commented 8 months ago

I ran a plasmid I had as well, standalone, and I was not able to reproduce the results.

azat-badretdin commented 8 months ago

Looking more into the cwltool.log file you posted I can see that this is a full blown assembly.

Could you please post head -50 cwltool.log output?

Chahrazadt87 commented 8 months ago

Hi Azat,

It is a whole genome assembly with a few contigs. Could you please explain what you mean by "head -50 cwltool.log". I am new to this, so bear with me :)

azat-badretdin commented 8 months ago

head is a unix command that produces only specified number of first lines of the text file.

azat-badretdin commented 8 months ago

Chahrazad, would you be willing to post the input genome FASTA file?

Chahrazadt87 commented 8 months ago

Sure, can you please give me your email so that I can send it to you?

azat-badretdin commented 8 months ago

It's our official email The data stays there strictly on need to know basis.

Chahrazadt87 commented 8 months ago

I just emailed you the genome :)

Thanks again for looking into this.

azat-badretdin commented 8 months ago

Thanks! Running it now...

azat-badretdin commented 8 months ago

Nope. Could not reproduce with exactly your input either. So, how about that head -50 cwltool.log output?

head -50 cwltool.log > head.50.txt

and attach it here?

Chahrazadt87 commented 7 months ago

We started working on this, Chahrazad.

Could you please tarball and post the contents of the directory */tmp-outdir/usg1w98j? Thanks!

Chahrazadt87 commented 7 months ago

Hi Azat,

Apologies for my late response. I have attached all that you need now.

Many thanks,

Chahrazad head.50.txt

azat-badretdin commented 7 months ago

Thanks for the files, Chahrazad!

I see that you are running this on Mac. Just pointing this here for the purposes of indexing.

Could you please post output of

find . -name annotation.fa | xargs /bin/ls -ltr


Chahrazadt87 commented 7 months ago

-rw-r--r--@ 1 ct1221 staff 1306409 20 Nov 13:25 ./SS5_8_results/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/bins-prot/annotation.fa -rw-r--r--@ 1 ct1221 staff 1306409 20 Nov 13:25 ./SS5_8_results/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/annotation.fa

azat-badretdin commented 7 months ago

Thanks, Chahrazad!

So, this is different from PGAP-8585 case

azat-badretdin commented 7 months ago

While we are scratching our heads, let me at least pass back to you what we successfully calculated in-house.

I am going to find out what's our SOP on this.

azat-badretdin commented 7 months ago

let me at least pass back to you what we successfully calculated in-house

Hi, Chahrazad! Could you please confirm that you got the results?

Another question: can you try to run standalone checkm on your input file debug/tmpdir/*/checkm.*/fasta_by_scaffold/bins-prot/annotation.fa:

mkdir -p bins-prot/
cp debug/tmpdir/*/checkm.*/fasta_by_scaffold/bins-prot/annotation.fa bins-prot/
checkm taxonomy_wf -t 1 -g -x fa genus Halorubrum bins-prot/ taxonomy_wf-prot/


Chahrazadt87 commented 7 months ago

Hi Azat,

I got this message when I ran it:

[2023-12-13 13:26:58] INFO: CheckM data: /Users/ct1221/.checkm

[2023-12-13 13:26:58] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.

Unexpected error: <class 'FileNotFoundError'>

Traceback (most recent call last):

File "/Users/ct1221/opt/anaconda3/bin/checkm", line 856, in


File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/", line 991, in parseOptions


File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/", line 293, in taxonSet

bValidSet = taxonParser.markerSet(

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/", line 82, in markerSet

taxonMarkerSets = self.readMarkerSets()

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/", line 40, in readMarkerSets

for line in open(DefaultValues.TAXON_MARKER_SETS):

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ct1221/.checkm/taxon_marker_sets.tsv'

Please note, that sometimes the run works when I shut down everything or clear the cache. It’s not very consistent though so I’m still confused as to the reason this happens.

Kind regards,


From: Azat Badretdin @.> Date: Wednesday, 13 December 2023 at 10:56 To: ncbi/pgap @.> Cc: Taissir, Chahrazad @.>, Mention @.> Subject: Re: [ncbi/pgap] [BUG] PGAP analysis generates all files except .aa and .gbk (Issue #276) This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list to disable email stamping for this address.

let me at least pass back to you what we successfully calculated in-house

Hi, Chahrazad! Could you please confirm that you got the results?

Another question: can you try to run standalone checkm on your input file debug/tmpdir//checkm./fasta_by_scaffold/bins-prot/annotation.fa:

mkdir -p bins-prot/

cp debug/tmpdir//checkm./fasta_by_scaffold/bins-prot/annotation.fa bins-prot/

checkm taxonomy_wf -t 1 -g -x fa genus Halorubrum bins-prot/ taxonomy_wf-prot/


— Reply to this email directly, view it on GitHub, or unsubscribe You are receiving this because you were mentioned.Message ID: @.***>

azat-badretdin commented 7 months ago

Thanks, how do you run it? It looks to me that you run it directly on your Mac, not from under virtual machine/container. I would recommend to fix your local installation and run again. Right now the error you are getting is not related to our problem - it is a problem of your installation. The installation of checkm that is using is inside docker container and the data is elsewhere as well