ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
93 stars 9 forks source link

Error with convert proteins step #3

Closed CEPHAS-01 closed 6 months ago

CEPHAS-01 commented 7 months ago

Hi,

Happy to finally see this tool released, thanks.

I initiated the test run on local configuration but encountered the error below with the protein conversion step

python3 Tools/software/egapx/ui/egapx.py Tools/software/egapx/examples/input_D_farinae_small.yaml -e local -w egapAnnotation/testrun -o testRunOutput

ERROR ~ Error executing process > 'egapx:setup_proteins:convert_proteins'

Caused by: Process egapx:setup_proteins:convert_proteins terminated with an error exit status (127)

Command executed:

mkdir -p asn mkdir -p fasta if [[ true == true ]]; then zcat src/6954.faa.gz | sed 's/>([^ |]+)( .)\?$/>lcl|\1\2/' > fasta/6954.faa else sed 's/>([^ |]+)( .)\?$/>lcl|\1\2/' src/6954.faa.gz > fasta/6954.faa fi multireader -flags ParseRawID -out-format asn_text -input fasta/6954.faa -output asn/6954.asn

Command exit status: 127

Command output: (empty)

Command error: .command.sh: line 9: multireader: command not found

Please how do I fix this?

Also, if I want to use the singularity mode, where do I download the singularity image file?

Thanks in advance.

TLag

pstrope commented 7 months ago

Hi, -e local is currently for running internally at NCBI. Please try the same command with -e docker or -e singularity whichever applies to you.

pstrope commented 7 months ago

where do I download the singularity image file?

when you use -e singularity, the workflow automatically pulls the docker image and converts it to a singularity image.

CEPHAS-01 commented 7 months ago

Hi @pstrope Thanks so much for your feedback. I have updated the command as you requested and it is running so far. I will provide you updates later.

TLag

CEPHAS-01 commented 7 months ago

Slurm terminated the job apparently due to an issue with an HMM process:

SLURM: slurmstepd: error: JOB 4728352 STEPD TERMINATED ON n049 AT 2024-04-10T08:57:39 DUE TO JOB NOT ENDING WITH SIGNALS

Log from Std out:

Wed Apr 10 07:57:29 PDT 2024 N E X T F L O W ~ version 23.10.1 Launching Tools/software/egapx/ui/../nf/ui.nf [trusting_cori] DSL2 - revision: c134f40af5 in egapx block [- ] process > egapx:setup_genome:get_geno... - [- ] process > egapx:setup_proteins:conver... - [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort -

[- ] process > egapx:setup_genome:get_geno... - [- ] process > egapx:setup_proteins:conver... - [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort - [- ] process > egapx:star_index:build_index - [- ] process > egapx:star_simplified:exec - [- ] process > egapx:bam_strandedness:exec - [- ] process > egapx:bam_strandedness:merge - [- ] process > egapx:bam_bin_and_sort:calc... - [- ] process > egapx:bam_bin_and_sort:bam_bin - [- ] process > egapx:bam_bin_and_sort:merg... - [- ] process > egapx:bam_bin_and_sort:merge - [- ] process > egapx:bam2asn:convert - [- ] process > egapx:rnaseq_collapse:gener... - [- ] process > egapx:rnaseq_collapse:run_r... - [- ] process > egapx:rnaseq_collapse:run_g... - [- ] process > egapx:get_hmm_params:run_ge... - [- ] process > egapx:chainer:run_align_sort - [- ] process > egapx:chainer:generate_jobs - [- ] process > egapx:chainer:run_chainer - [- ] process > egapx:chainer:run_gpxmake... - [- ] process > egapx:gnomon_wnode:gpx_qsubmit - [- ] process > egapx:gnomon_wnode:annot - [- ] process > egapx:gnomon_wnode:gpx_qdump - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... -

[- ] process > egapx:setup_genome:get_geno... - [- ] process > egapx:setup_proteins:conver... - [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort - [- ] process > egapx:star_index:build_index - [- ] process > egapx:star_simplified:exec - [- ] process > egapx:bam_strandedness:exec - [- ] process > egapx:bam_strandedness:merge - [- ] process > egapx:bam_bin_and_sort:calc... - [- ] process > egapx:bam_bin_and_sort:bam_bin - [- ] process > egapx:bam_bin_and_sort:merg... - [- ] process > egapx:bam_bin_and_sort:merge - [- ] process > egapx:bam2asn:convert - [- ] process > egapx:rnaseq_collapse:gener... - [- ] process > egapx:rnaseq_collapse:run_r... - [- ] process > egapx:rnaseq_collapse:run_g... - [- ] process > egapx:get_hmm_params:run_ge... - [- ] process > egapx:chainer:run_align_sort - [- ] process > egapx:chainer:generate_jobs - [- ] process > egapx:chainer:run_chainer - [- ] process > egapx:chainer:run_gpxmake... - [- ] process > egapx:gnomon_wnode:gpx_qsubmit - [- ] process > egapx:gnomon_wnode:annot - [- ] process > egapx:gnomon_wnode:gpx_qdump - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annotwriter:run_annot... - [- ] process > export - Pulling Singularity image docker://ncbi/egapx:latest [cache egapAnnotation/testrun/singularity/ncbi-egapx-latest.img] WARN: Singularity cache directory has not been defined -- Remote image will be stored in the path: egapAnnotation/testrun/singularity -- Use the environment variable NXF_SINGULARITY_CACHEDIR to specify a different location

executor > local (4) [a4/ab36b3] process > egapx:setup_genome:get_geno... [ 0%] 0 of 1 [ff/6202e7] process > egapx:setup_proteins:conver... [ 0%] 0 of 1 [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort - [- ] process > egapx:star_index:build_index - [- ] process > egapx:star_simplified:exec - [- ] process > egapx:bam_strandedness:exec - [- ] process > egapx:bam_strandedness:merge - [- ] process > egapx:bam_bin_and_sort:calc... - [- ] process > egapx:bam_bin_and_sort:bam_bin - [- ] process > egapx:bam_bin_and_sort:merg... - [- ] process > egapx:bam_bin_and_sort:merge - [- ] process > egapx:bam2asn:convert - [- ] process > egapx:rnaseq_collapse:gener... - [- ] process > egapx:rnaseq_collapse:run_r... - [- ] process > egapx:rnaseq_collapse:run_g... - [84/f10b05] process > egapx:get_hmm_params:run_ge... [ 0%] 0 of 1 [- ] process > egapx:chainer:run_align_sort - [- ] process > egapx:chainer:generate_jobs - [- ] process > egapx:chainer:run_chainer - [- ] process > egapx:chainer:run_gpxmake... - [- ] process > egapx:gnomon_wnode:gpx_qsubmit - [- ] process > egapx:gnomon_wnode:annot - [- ] process > egapx:gnomon_wnode:gpx_qdump - [92/659f32] process > egapx:annot_builder:annot_b... [ 0%] 0 of 1 [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annotwriter:run_annot... - [- ] process > export -

executor > local (4) [a4/ab36b3] process > egapx:setup_genome:get_geno... [ 0%] 0 of 1 [ff/6202e7] process > egapx:setup_proteins:conver... [ 0%] 0 of 1 [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort - [- ] process > egapx:star_index:build_index - [- ] process > egapx:star_simplified:exec - [- ] process > egapx:bam_strandedness:exec - [- ] process > egapx:bam_strandedness:merge - [- ] process > egapx:bam_bin_and_sort:calc... - [- ] process > egapx:bam_bin_and_sort:bam_bin - [- ] process > egapx:bam_bin_and_sort:merg... - [- ] process > egapx:bam_bin_and_sort:merge - [- ] process > egapx:bam2asn:convert - [- ] process > egapx:rnaseq_collapse:gener... - [- ] process > egapx:rnaseq_collapse:run_r... - [- ] process > egapx:rnaseq_collapse:run_g... - [84/f10b05] process > egapx:get_hmm_params:run_ge... [ 0%] 0 of 1 [- ] process > egapx:chainer:run_align_sort - [- ] process > egapx:chainer:generate_jobs - [- ] process > egapx:chainer:run_chainer - [- ] process > egapx:chainer:run_gpxmake... - [- ] process > egapx:gnomon_wnode:gpx_qsubmit - [- ] process > egapx:gnomon_wnode:annot - [- ] process > egapx:gnomon_wnode:gpx_qdump - [92/659f32] process > egapx:annot_builder:annot_b... [ 0%] 0 of 1 [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annotwriter:run_annot... - [- ] process > export - ERROR ~ Error executing process > 'egapx:get_hmm_params:run_get_hmm'

Caused by: Process egapx:get_hmm_params:run_get_hmm terminated with an error exit status (255)

Command executed:

!/usr/bin/env python3

import json from urllib.request import urlopen def get_closest_hmm(taxid): taxon_str = str(taxid) if not taxon_str: return "" dataset_taxonomy_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/taxonomy/taxon/"

  taxids_file = urlopen("https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/gnomon/hmm_parameters/taxid.list")
  taxids_list = []
  lineages = []
  for line in taxids_file:
      parts = line.decode("utf-8").strip().split('    ')
      if len(parts) > 0:
          t = parts[0]
          taxids_list.append(t)
          if len(parts) > 1:
              l = map(lambda x: int(x) if x[-1] != ';' else int(x[:-1]), parts[1].split())
              lineages.append((int(t), list(l)+[int(t)]))

  if len(lineages) < len(taxids_list):
      taxonomy_json_file = urlopen(dataset_taxonomy_url+','.join(taxids_list))
      taxonomy = json.load(taxonomy_json_file)["taxonomy_nodes"]
      lineages = [ (t["taxonomy"]["tax_id"], t["taxonomy"]["lineage"] + [t["taxonomy"]["tax_id"]]) for t in taxonomy ]

  taxon_json_file = urlopen(dataset_taxonomy_url+taxon_str)
  taxon = json.load(taxon_json_file)["taxonomy_nodes"][0]
  lineage = taxon["taxonomy"]["lineage"]
  lineage.append(taxon["taxonomy"]["tax_id"])
  # print(lineage)
  # print(taxon["taxonomy"]["organism_name"])

  best_lineage = None
  best_taxid = None
  best_score = 0
  for (t, l) in lineages:
      pos1 = 0
      last_match = 0
      for pos in range(len(lineage)):
          tax_id = lineage[pos]
          while tax_id != l[pos1]:
              if pos1 + 1 < len(l):
                  pos1 += 1
              else:
                  break
          if tax_id == l[pos1]:
              last_match = pos1
          else:
              break
      if last_match > best_score:
          best_score = last_match
          best_taxid = t
          best_lineage = l

  if best_score == 0:
      return ""
  # print(best_lineage)
  # print(best_taxid, best_score)
  return f'https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/gnomon/hmm_parameters/{best_taxid}.params'

print(get_closest_hmm(6954))

Command exit status: 255

Command output: (empty)

Command error: INFO: Converting SIF file to temporary sandbox... FATAL: stat /bin/bash: no such file or directory INFO: Cleaning up image...

Work dir: egapAnnotation/testrun/84/f10b0550595d959ae71d9138ed2599

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check 'egapAnnotation/testrun/testRun_out/nextflow.log' file for details WARN: Killing running tasks (3)

executor > local (4) [- ] process > egapx:setup_genome:get_geno... - [- ] process > egapx:setup_proteins:conver... - [- ] process > egapx:miniprot:run_miniprot - [- ] process > egapx:paf2asn:run_paf2asn - [- ] process > egapx:best_aligned_prot:run... - [- ] process > egapx:align_filter_sa:run_a... - [- ] process > egapx:run_align_sort - [- ] process > egapx:star_index:build_index - [- ] process > egapx:star_simplified:exec - [- ] process > egapx:bam_strandedness:exec - [- ] process > egapx:bam_strandedness:merge - [- ] process > egapx:bam_bin_and_sort:calc... - [- ] process > egapx:bam_bin_and_sort:bam_bin - [- ] process > egapx:bam_bin_and_sort:merg... - [- ] process > egapx:bam_bin_and_sort:merge - [- ] process > egapx:bam2asn:convert - [- ] process > egapx:rnaseq_collapse:gener... - [- ] process > egapx:rnaseq_collapse:run_r... - [- ] process > egapx:rnaseq_collapse:run_g... - [84/f10b05] process > egapx:get_hmm_params:run_ge... [100%] 1 of 1, failed: 1 ✘ [- ] process > egapx:chainer:run_align_sort - [- ] process > egapx:chainer:generate_jobs - [- ] process > egapx:chainer:run_chainer - [- ] process > egapx:chainer:run_gpxmake... - [- ] process > egapx:gnomon_wnode:gpx_qsubmit - [- ] process > egapx:gnomon_wnode:annot - [- ] process > egapx:gnomon_wnode:gpx_qdump - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annot_builder:annot_b... - [- ] process > egapx:annotwriter:run_annot... - [- ] process > export - ERROR ~ Error executing process > 'egapx:get_hmm_params:run_get_hmm'

Caused by: Process egapx:get_hmm_params:run_get_hmm terminated with an error exit status (255)

Command executed:

!/usr/bin/env python3

import json from urllib.request import urlopen def get_closest_hmm(taxid): taxon_str = str(taxid) if not taxon_str: return "" dataset_taxonomy_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/taxonomy/taxon/"

  taxids_file = urlopen("https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/gnomon/hmm_parameters/taxid.list")
  taxids_list = []
  lineages = []
  for line in taxids_file:
      parts = line.decode("utf-8").strip().split('    ')
      if len(parts) > 0:
          t = parts[0]
          taxids_list.append(t)
          if len(parts) > 1:
              l = map(lambda x: int(x) if x[-1] != ';' else int(x[:-1]), parts[1].split())
              lineages.append((int(t), list(l)+[int(t)]))

  if len(lineages) < len(taxids_list):
      taxonomy_json_file = urlopen(dataset_taxonomy_url+','.join(taxids_list))
      taxonomy = json.load(taxonomy_json_file)["taxonomy_nodes"]
      lineages = [ (t["taxonomy"]["tax_id"], t["taxonomy"]["lineage"] + [t["taxonomy"]["tax_id"]]) for t in taxonomy ]

  taxon_json_file = urlopen(dataset_taxonomy_url+taxon_str)
  taxon = json.load(taxon_json_file)["taxonomy_nodes"][0]
  lineage = taxon["taxonomy"]["lineage"]
  lineage.append(taxon["taxonomy"]["tax_id"])
  # print(lineage)
  # print(taxon["taxonomy"]["organism_name"])

  best_lineage = None
  best_taxid = None
  best_score = 0
  for (t, l) in lineages:
      pos1 = 0
      last_match = 0
      for pos in range(len(lineage)):
          tax_id = lineage[pos]
          while tax_id != l[pos1]:
              if pos1 + 1 < len(l):
                  pos1 += 1
              else:
                  break
          if tax_id == l[pos1]:
              last_match = pos1
          else:
              break
      if last_match > best_score:
          best_score = last_match
          best_taxid = t
          best_lineage = l

  if best_score == 0:
      return ""
  # print(best_lineage)
  # print(best_taxid, best_score)
  return f'https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/gnomon/hmm_parameters/{best_taxid}.params'

print(get_closest_hmm(6954))

Command exit status: 255

Command output: (empty)

Command error: INFO: Converting SIF file to temporary sandbox... FATAL: stat /bin/bash: no such file or directory INFO: Cleaning up image...

Work dir: egapAnnotation/testrun/84/f10b0550595d959ae71d9138ed2599

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check 'egapAnnotation/testrun/testRun_out/nextflow.log' file for details

!!WARNING!! This is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use.

None To resume execution, run: nextflow -C egapAnnotation/testrun/egapx_config/singularity.config,Tools/software/egapx/ui/assets/config/default.config,Tools/software/egapx/ui/assets/config/docker_image.config,Tools/software/egapx/ui/assets/config/process_resources.config -log egapAnnotation/testrun/testRun_out/nextflow.log run Tools/software/egapx/ui/../nf/ui.nf --output egapAnnotation/testrun/testRun_out -with-report egapAnnotation/testrun/testRun_out/run.report.html -with-timeline egapAnnotation/testrun/testRun_out/run.timeline.html -with-trace egapAnnotation/testrun/testRun_out/run.trace.txt -params-file egapAnnotation/testrun/testRun_out/run_params.yaml -resume Don't forget to delete file(s) /tmp/tmp7_2je1cx Wed Apr 10 08:56:39 PDT 2024

pstrope commented 7 months ago

Please post the full command that you ran. Thanks.

CEPHAS-01 commented 7 months ago

python3 Tools/software/egapx/ui/egapx.py Tools/software/egapx/input_ramb_2.yml -e singularity -w egapAnnotation -o ramb2_out

pstrope commented 7 months ago

Thanks. Did you try with the included example YAML file (input_D_farinae_small.yaml), and did it run to completion?

CEPHAS-01 commented 7 months ago

Yes, the run with the included input_D_farinae_small.yaml file gave me the error actually. I also tested on my dataset, the command on the dataset is what I posted, but I used the same command but the appropriate "input_D_farinae_small.yaml" for the run with the test data.

The command for the test run:

python3 Tools/software/egapx/ui/egapx.py Tools/software/egapx/examples/input_D_farinae_small.yaml -e singularity -w egapAnnotation/testrun -o testRun_out

The only files in the testRun_out directory run_params.yaml run.trace.txt run.report.html run.timeline.html nextflow.log

The test run with the input_D_farinae_small.yaml is what produced the error I posted earlier.

pstrope commented 7 months ago

OK, we will look into it and get back to you. Thank you for testing and reporting.

Pooja

CEPHAS-01 commented 7 months ago

Okay thanks.

FWIW I don't know if the complete nextflow log would be useful but it's quite long to be posted here 721 lines. Is this something you would like to look at?

pstrope commented 7 months ago

Yes, you could attach the file here. That would be helpful.

CEPHAS-01 commented 7 months ago

Alright, here it is. nextflow.log

boukn commented 7 months ago

The stand-out error from that is

FATAL: stat /bin/bash: no such file or directory

is that an actual quirk of the machine you are running on? Is it there, but its an NFS-type mount and its flaky sometimes?

CEPHAS-01 commented 7 months ago

Yes it is an NFS over an HPC system. From my understanding of the line, the .sif image is what appears to be "missing". Something I could try is to try this on my macbook and see if it runs through. What is the estimated size of the .sif image file?

CEPHAS-01 commented 7 months ago

I can confirm that the image was successfully pulled, about 500MB.

pstrope commented 7 months ago

Hi, it looks like opening the container might be an issue. Let's see if that is truly the case. Can you run the following and see if it works (ie, version is printed)

singularity exec path_to_downloaded_image/singularity/ncbi-egapx-latest.img getfasta -version

CEPHAS-01 commented 7 months ago

The container opened with the output : getfasta: 0.0.1670

murphyte commented 7 months ago

Hi Temitayo -- I just wanted to give you an update that we're working on some additional testing here on a different HPC we have access to to see if we can reproduce your issue. One other notable issue showing up in the log is:

Process egapx:get_hmm_params:run_get_hmm terminated with an error exit status (255)

We're seeing something along those lines in some other testing that looks like an issue with web API access from within Singularity, and investigating how we can solve it. We may also need to set up a Zoom to help pick apart what's going on here, but there's enough similarity to one issue that we can reproduce that we'll try to resolve that first and see if it helps you.

Stay tuned!

CEPHAS-01 commented 7 months ago

Hi @murphyte thanks so much for your feedback. I have also run this on another HPC and got another type of error, but I want to look into this properly first, and perhaps run on a third HPC that I have access to before reporting it here. I am okay with a Zoom call to help pick the issues apart, just let me know when you are up for it. Looking forward to seeing this resolved.

victzh commented 7 months ago

Hi, I'm a developer on EGAPx team. What HPC clusters do you use specifically and do they have publicly available documentation? I'd like to look at them before Zoom call. We had recently adapted EGAPx to Biowulf, and there were some incompatibilities to solve.

Victor.

CEPHAS-01 commented 7 months ago

Hi Victor,

The cluster runs on Linux version: 3.10.0-1160.83.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) Slurm for job management Openmpi for message parsing

please share your email address and I will send you the URL to the available institution documentation.

Thanks in advance.

Temitayo

victzh commented 6 months ago

That was an actual bug which is fixed in the next upcoming release, 0.1.1 alpha.