ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

[BUG] PGAP analysis generates all files except .aa and .gbk #276

Closed Chahrazadt87 closed 5 months ago

Chahrazadt87 commented 8 months ago

Hello,

I have been having an issue with a couple of genomes where the results do not contain all of the output files especially, the annot.aa and the .gbk files. I have looked at the log files and the only error I can see is below. Does anyone have an idea of how to fix this please?

Error: error processing job: (CFileException::eFileIO) Error opening checkm dombtblout: /pgap/output/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/checkm.out terminate called after throwing an instance of 'ncbi::CException' what(): NCBI C++ Exception: Error: LIB(CException::eUnknown) "/export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/internal/gpipe/gpexec/queue/lib/wn_app.cpp", line 411: ncbi::CGPX_WorkerApp::Run() --- 1 jobs failed Stack trace: /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/lib/libgpxlib.so /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/internal/gpipe/gpexec/queue/lib/wn_app.cpp:409 ncbi::CGPX_WorkerApp::Run() offset=0x0 addr=0x7f098f4f229f /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/bin/checkm_wnode :0 offset=0x0 addr=0x41a430 /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/lib/libxncbi.so /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp:711 ncbi::CNcbiApplicationAPI::x_TryMain(ncbi::EAppDiagStream, char const, int, bool) offset=0x0 addr=0x7f097674a132 /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/lib/libxncbi.so /export/home/gpipe/TeamCity/Agent3/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp:1023 ncbi::CNcbiApplicationAPI::AppMain(int, char const const, char const const, ncbi::EAppDiagStream, char const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) offset=0x0 addr=0x7f097674d78c /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/bin/checkm_wnode :0 offset=0x0 addr=0x40bc82 /usr/lib64/libc-2.17.so :0 offset=0x0 addr=0x7f0975176554 /panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-10-03.build7061/arch/x86_64/bin/checkm_wnode :0 offset=0x0 addr=0x40be59

azat-badretdin commented 8 months ago

Thank you for your report, user @Chahrazadt87 !

Could you please a larger portion of the log? We would like to see something that ends with first permanentFail and starts with last line that contains INFO...... <some path>$ <application_name> \

Thanks

azat-badretdin commented 8 months ago

And if you can post the whole log file, that would be even better!

Chahrazadt87 commented 8 months ago

Hi Azat,

Thank you for your response. I have attached the partial log file (too large to attach all of it). Please be aware that I only get the missing output files once I add --debug to the command. Many thanks, Chahrazad cwltool.log

azat-badretdin commented 8 months ago

Thanks. Looks like you were trying to upload the whole log, but it seems that initial portion of the file is still missing.

Chahrazadt87 commented 8 months ago

The file is too large to upload I'm afraid. My file is 30MB and the limit is 25MB

azat-badretdin commented 8 months ago

Could you please post the command line?

Chahrazadt87 commented 8 months ago

Sure: ./pgap.py -r -o Documents/Halorubrum_SS5_8/SS5_8_results Documents/Halorubrum_SS5_8/SS5_8.yaml

azat-badretdin commented 8 months ago

I see that you are using "old school" method of supplying user information via YAML. Could you please post that YAML file as well?

Thanks

Chahrazadt87 commented 8 months ago

Please be aware that the following works for all other strains. Only a handful fail.

topology: 'circular' location: 'chromosome' organism: genus_species: 'Halorubrum sp. SS5-8' strain: 'my_strain' contact_info: last_name: 'Warnecke' first_name: 'Tobias' email: 't.w@lms.mrc.ac.uk' organization: 'MRC London Institute of Medical Sciences' department: 'Molecular Systems Group' street: 'Du Cane Rd' city: 'London' postal_code: 'W12 0NN' state: 'Greater London' country: 'United Kingdom' authors:

azat-badretdin commented 8 months ago

Thanks, Chahrazad!

I got the genome species:

$ gettax -dates 'Halorubrum sp. SS5-8'

    scientific name: Halorubrum sp. SS5-8
             tax id: 1089755
      parent tax id: 2642239
             gb_div: Bacteria
               rank: species
            lineage: Archaea; Euryarchaeota; Stenosarchaea group; Halobacteria;
                     Haloferacales; Haloferacaceae; Halorubrum
              id_gc: 11
            name_gc: Bacterial, Archaeal and Plant Plastid
             id_mgc: 0
           name_mgc: Unspecified
           crt_date: 2011/09/26 14:32:50
           upd_date: 2011/10/23 17:33:10
           pub_date: 2011/10/22 18:00:26

So we can eliminate the taxonomic novelty factor here in checkm failure. Another suspicious factor is that it is Archaeal. But that also did not work out as a culprit: checkm data, no matter how old it is (2015) does have plenty of that taxonomic lineage in the database.

Upon closer examination of cwltool.log file you posted I stumbled upon the error message that I missed previously:


Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/python-3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
PermissionError: [Errno 1] Operation not permitted
Traceback (most recent call last):
  File "/root/venv/bin/checkm", line 856, in <module>
    checkmParser.parseOptions(args)
  File "/root/venv/lib/python3.9/site-packages/checkm/main.py", line 992, in parseOptions
    self.analyze(options)
  File "/root/venv/lib/python3.9/site-packages/checkm/main.py", line 326, in analyze
    binIdToModels = mgf.find(binFiles,
  File "/root/venv/lib/python3.9/site-packages/checkm/markerGeneFinder.py", line 68, in find
    binIdToModels = mp.Manager().dict()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/python-3.9/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/python-3.9/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

I think we have enough information now to try reproduce this at our side. I suppose you are using the latest version of PGAP package, correct?

Chahrazadt87 commented 8 months ago

Hi Azat,

Yes I am. Everything is up to date software wise.

Many thanks,

Chahrazad

azat-badretdin commented 8 months ago

We will try to look at this ASAP.

Chahrazadt87 commented 8 months ago

Good morning Azat,

Any news on this issue please?

Many thanks,

Chahrazad

azat-badretdin commented 8 months ago

We started working on this, Chahrazad.

Could you please tarball and post the contents of the directory */tmp-outdir/usg1w98j? Thanks!

azat-badretdin commented 8 months ago

I was not able to reproduce your results with the same species, Chahrazad. For input data I put a similar species from the same genome and I have not got anything

azat-badretdin commented 8 months ago

Is your input by any chance a single plasmid?

azat-badretdin commented 8 months ago

I ran a plasmid I had as well, standalone, and I was not able to reproduce the results.

azat-badretdin commented 8 months ago

Looking more into the cwltool.log file you posted I can see that this is a full blown assembly.

Could you please post head -50 cwltool.log output?

Chahrazadt87 commented 8 months ago

Hi Azat,

It is a whole genome assembly with a few contigs. Could you please explain what you mean by "head -50 cwltool.log". I am new to this, so bear with me :)

azat-badretdin commented 8 months ago

head is a unix command that produces only specified number of first lines of the text file.

azat-badretdin commented 8 months ago

Chahrazad, would you be willing to post the input genome FASTA file?

Chahrazadt87 commented 8 months ago

Sure, can you please give me your email so that I can send it to you?

azat-badretdin commented 8 months ago

It's our official email prokaryote-tools@ncbi.nlm.nih.gov The data stays there strictly on need to know basis.

Chahrazadt87 commented 8 months ago

I just emailed you the genome :)

Thanks again for looking into this.

azat-badretdin commented 8 months ago

Thanks! Running it now...

azat-badretdin commented 8 months ago

Nope. Could not reproduce with exactly your input either. So, how about that head -50 cwltool.log output?

head -50 cwltool.log > head.50.txt

and attach it here?

Chahrazadt87 commented 7 months ago

We started working on this, Chahrazad.

Could you please tarball and post the contents of the directory */tmp-outdir/usg1w98j? Thanks!

usg1w98j.zip

Chahrazadt87 commented 7 months ago

Hi Azat,

Apologies for my late response. I have attached all that you need now.

Many thanks,

Chahrazad head.50.txt

azat-badretdin commented 7 months ago

Thanks for the files, Chahrazad!

I see that you are running this on Mac. Just pointing this here for the purposes of indexing.

Could you please post output of

find . -name annotation.fa | xargs /bin/ls -ltr

Thanks!

Chahrazadt87 commented 7 months ago

-rw-r--r--@ 1 ct1221 staff 1306409 20 Nov 13:25 ./SS5_8_results/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/bins-prot/annotation.fa -rw-r--r--@ 1 ct1221 staff 1306409 20 Nov 13:25 ./SS5_8_results/debug/tmpdir/m8jqspdl/checkm.1296702415621568dOBsj4/fasta_by_scaffold/annotation.fa

azat-badretdin commented 7 months ago

Thanks, Chahrazad!

So, this is different from PGAP-8585 case

azat-badretdin commented 7 months ago

While we are scratching our heads, let me at least pass back to you what we successfully calculated in-house.

I am going to find out what's our SOP on this.

azat-badretdin commented 7 months ago

let me at least pass back to you what we successfully calculated in-house

Hi, Chahrazad! Could you please confirm that you got the results?

Another question: can you try to run standalone checkm on your input file debug/tmpdir/*/checkm.*/fasta_by_scaffold/bins-prot/annotation.fa:

mkdir -p bins-prot/
cp debug/tmpdir/*/checkm.*/fasta_by_scaffold/bins-prot/annotation.fa bins-prot/
checkm taxonomy_wf -t 1 -g -x fa genus Halorubrum bins-prot/ taxonomy_wf-prot/

Thanks!

Chahrazadt87 commented 7 months ago

Hi Azat,

I got this message when I ran it:

[2023-12-13 13:26:58] INFO: CheckM data: /Users/ct1221/.checkm

[2023-12-13 13:26:58] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.

Unexpected error: <class 'FileNotFoundError'>

Traceback (most recent call last):

File "/Users/ct1221/opt/anaconda3/bin/checkm", line 856, in

checkmParser.parseOptions(args)

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/main.py", line 991, in parseOptions

self.taxonSet(options)

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/main.py", line 293, in taxonSet

bValidSet = taxonParser.markerSet(

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/taxonParser.py", line 82, in markerSet

taxonMarkerSets = self.readMarkerSets()

File "/Users/ct1221/opt/anaconda3/lib/python3.9/site-packages/checkm/taxonParser.py", line 40, in readMarkerSets

for line in open(DefaultValues.TAXON_MARKER_SETS):

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ct1221/.checkm/taxon_marker_sets.tsv'

Please note, that sometimes the run works when I shut down everything or clear the cache. It’s not very consistent though so I’m still confused as to the reason this happens.

Kind regards,

Chahrazad

From: Azat Badretdin @.> Date: Wednesday, 13 December 2023 at 10:56 To: ncbi/pgap @.> Cc: Taissir, Chahrazad @.>, Mention @.> Subject: Re: [ncbi/pgap] [BUG] PGAP analysis generates all files except .aa and .gbk (Issue #276) This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

let me at least pass back to you what we successfully calculated in-house

Hi, Chahrazad! Could you please confirm that you got the results?

Another question: can you try to run standalone checkm on your input file debug/tmpdir//checkm./fasta_by_scaffold/bins-prot/annotation.fa:

mkdir -p bins-prot/

cp debug/tmpdir//checkm./fasta_by_scaffold/bins-prot/annotation.fa bins-prot/

checkm taxonomy_wf -t 1 -g -x fa genus Halorubrum bins-prot/ taxonomy_wf-prot/

Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/ncbi/pgap/issues/276#issuecomment-1853694549, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWNU5UA376WTDJW5XGIEAQ3YJGCWNAVCNFSM6AAAAAA7TBD5N2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTGY4TINJUHE. You are receiving this because you were mentioned.Message ID: @.***>

azat-badretdin commented 7 months ago

Thanks, how do you run it? It looks to me that you run it directly on your Mac, not from under virtual machine/container. I would recommend to fix your local installation and run again. Right now the error you are getting is not related to our problem - it is a problem of your installation. The installation of checkm that pgap.py is using is inside docker container and the data is elsewhere as well