geneidx tries to write in location where it should only read & taxon id not found

KatharinaHoff commented 1 year ago

Hi,

I tried to run geneidx. With singularity, I failed to pull the image - but possibly that's an issue on my own server.

Error executing process > 'param_value_selection_workflow:getParamName (7240)'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ferriolcalvet-geneidx.img.pulling.1679713549962 docker://ferriolcalvet/geneidx > /dev/null
  status : 2
  message:
    Usage: singularity [options]

    singularity: error: no such option: --name

Here's my output of singularity --version (installed via apt-get on Ubuntu ):

Singularity 1.00 (commit: 2ebc2f3f2059b96885416167363bde2e27ece106)
Running under Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
pygame 2.1.2 (SDL 2.0.20, Python 3.10.6)
Hello from the pygame community. https://www.pygame.org/contribute.html

With docker, I execute with root permissions. The image can be pulled. The problem is that apparently the container tries to write in places where the input data sits.

Here is my command:

sudo nextflow run main.nf -profile docker --genome /nas-hs/projs/data/Drosophila_melanogaster/data/genome.fasta.masked.gz --taxid 7240 --outdir .

Here is my error message:

Error executing process > 'UncompressFASTA (genome.fasta.masked.gz)'

Caused by:
  Process `UncompressFASTA (genome.fasta.masked.gz)` terminated with an error exit status (125)

Command executed:

  if [ ! -s  genome.fasta.masked ]; then
      echo "unzipping genome genome.fasta.masked.gz"
      gunzip -c genome.fasta.masked.gz > genome.fasta.masked;
  fi

Command exit status:
  125

Command output:
  (empty)

Command error:
  docker: Error response from daemon: error while creating mount source path '/nas-hs/projs/data/Drosophila_melanogaster/data': mkdir /nas-hs: read-only file system.
  time="2023-03-25T04:07:50+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/66/4a183e6f9fa512936ae52466a8b48b

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It is rather clear how to fix (don't pipe the unpacked genome to any other place but the output directory).

Next, I copied the gzipped genome into a folder where root has writing permissions and tried again. I am using reference species Drosophila simulans with taxon id 7240. That is indeed the taxon id in NCBI Taxonomy

Here is my new call:

sudo nextflow run main.nf -profile docker --genome genome.fasta.masked.gz --taxid 7240 --outdir .

It fails to find the taxon for some obscure reason. There are most definitely D. simulans proteins at NCBI for this taxon, I have the protein set on my harddrive, too. I just don't know how to start the pipeline with a local protein set. Or maybe it finds the proteins, but s.th. goes wrong looking for geneid parameters for this taxon?

Error messages:

N E X T F L O W  ~  version 22.10.7
Launching `main.nf` [clever_bardeen] DSL2 - revision: bb4f07340a

GeneidX
=============================================
output          : /home/katharina/git/geneidx/output
genome          : genome.fasta.masked.gz
taxon           : 7240

WARN: A process with name 'getFASTA2' is defined more than once in module script: /home/katharina/git/geneidx/subworkflows/CDS_estimates.nf -- Make sure to not define the same function as process
[-        ] process > UncompressFASTA -
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[-        ] process > prot_down_wo... -
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [  0%] 0 of 1
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8

  import os, sys
  import pandas as pd
  import requests
  from lxml import etree

  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"

  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None

  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []

      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])

      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)

  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage

  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###

      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]

      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])

      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue

          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)

          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))

      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])

      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")

  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")

      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)

  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")

  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)

  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)

  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)

  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())

      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)

      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8

  import os, sys
  import pandas as pd
  import requests
  from lxml import etree

  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"

  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None

  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []

      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])

      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)

  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage

  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###

      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]

      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])

      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue

          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)

          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))

      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])

      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")

  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")

      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)

  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")

  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)

  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)

  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)

  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())

      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)

      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Oops ...

Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8

  import os, sys
  import pandas as pd
  import requests
  from lxml import etree

  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"

  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None

  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []

      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])

      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)

  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage

  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###

      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]

      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])

      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue

          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)

          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))

      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])

      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")

  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")

      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)

  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")

  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)

  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)

  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)

  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())

      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)

      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
      print("/home/katharina/git/geneidx/data/Parameter_files.taxid/", selected_param, sep = "/", end = '')

Command exit status:
  127

Command output:
  (empty)

Command error:
  docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/etc/shadow" to rootfs at "/etc/shadow": mount /etc/shadow:/etc/shadow (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown.
  time="2023-03-25T04:11:47+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/88/8e185cac6455345234538354fbf905

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Best wishes,

Katharina

FerriolCalvet commented 1 year ago

Hi Katharina,

First of all thank you for trying geneidx and for reporting these issues. I will try to provide solutions to them and let me know whether it works.

Regarding pulling the singularity images, I have not found this issue before, but after checking the current singularity versions I see that it has been updated many times since version 1.0.0, so you could try to update it and then it should work fine. https://docs.sylabs.io/guides/3.0/user-guide/installation.html In this link they detail how to install the latest releases since using apt install singularity is installing only the first version.
For this issue with geneidx/Nextflow trying to write in a directory where it should not write. Based on our implementation all the output is written inside the output folder so there should not be any process trying to write in the input directory. It could be that when the pipeline tries to mount the directories for getting the input there is some writing involved, but I don't really know how Nextflow manages this. In one of the previous issues, there was a similar error regarding permissions and containers and it was solved by adding the --bind option to the executed command, you could try to do it this way and let me know if it works or not. I will look for other possible solutions.
Regarding this last error with the taxid, I have run the pipeline with the test dataset and this taxid and it worked without any problems selecting a set of proteins and also using the parameters from Drosophila melanogaster that are the closest ones. By looking at the output you uploaded, it looks like there is still an error with the mounting of the docker images. I would say that this is more likely to be the reason for the error. Again, if you could try with the --bind option it could contribute to solve it. In case you prefer to provide the set of proteins to use, you can add the --prot-file option to indicate the proteins to use. It should be compressed (.gz) and for getting a proper naming of all the files it would be ideal to have it as (.fa.gz). Also for the genome fasta file.

In order to test these problems with containers I would recommend to start by running the pipeline only indicating the taxid, since there are already a default genome and output directories.

nextflow run main.nf -profile docker --taxid 7240

If it does not work then I would be more convinced that the problem is due to the connections with the containers, if it works try adding the genome and your preferred output directory and try again.

Thank you again for trying geneidx and reporting these issues. Also let me know if it works or you find other problems.

Ferriol

KatharinaHoff commented 1 year ago

The documentation link to singularity installation is somehow broken. But I know the website. I have a more up-to-date singularity installed according to exactly these instructions on a different machine. (Hopefully unbroken link for future readers: https://docs.sylabs.io/guides/3.0/user-guide/installation.html ) . I moved to that machine (singularity version 3.6.3) and tried, again, with the --bind option. Now, I get a different error message:

Call:

nextflow run main.nf -profile singularity --genome /nas-hs/projs/data/Drosophila_melanogaster/data/genome.fasta.masked.gz --taxid 7240 --outdir . --bind

Output:

Error executing process > 'matchAssessment:Index_fai (genome.fasta.clean.fa)'

Caused by:
  Process `matchAssessment:Index_fai (genome.fasta.clean.fa)` terminated with an error exit status (1)

Command executed:

  if [ ! -s  genome.fasta.clean.fa.fai ]; then
      echo "indexing genome genome.fasta.clean.fa"
      samtools faidx -f genome.fasta.clean.fa
  fi

Command exit status:
  1

Command output:
  indexing genome genome.fasta.clean.fa

Command error:
  indexing genome genome.fasta.clean.fa
  [faidx] Could not build fai index genome.fasta.clean.fa.fai

Work dir:
  /home/katharina/git/geneidx/work/ea/39925d57993479add5f31774fb5b4b

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

So I try to run with your minimal input example (in Singularity). Call:

nextflow run main.nf -profile singularity --taxid 7240

That worked fine.

I tried it in Docker, too. Call:

sudo nextflow run main.nf -profile docker --taxid 7240

That died, again:

[sudo] password for katharina: 
N E X T F L O W  ~  version 22.10.7
Launching `main.nf` [distracted_miescher] DSL2 - revision: bb4f07340a

GeneidX
=============================================
output          : /home/katharina/git/geneidx/output
genome          : /home/katharina/git/geneidx/data/SampleGenomeSmall.fa.gz
taxon           : 7240

WARN: A process with name 'getFASTA2' is defined more than once in module script: /home/katharina/git/geneidx/subworkflows/CDS_estimates.nf -- Make sure to not define the same function as process
[-        ] process > UncompressFASTA                                              -
[-        ] process > fix_chr_names                                                -
[-        ] process > compress_n_indexFASTA                                        -
[-        ] process > prot_down_workflow:getProtFasta                              -
executor >  local (4)
[88/1f3652] process > UncompressFASTA (SampleGenomeSmall.fa.gz)                    [  0%] 0 of 1
executor >  local (4)
[88/1f3652] process > UncompressFASTA (SampleGenomeSmall.fa.gz)                    [100%] 1 of 1, failed: 1 ✘
executor >  local (4)
[88/1f3652] process > UncompressFASTA (SampleGenomeSmall.fa.gz)                    [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names                                                -
[-        ] process > compress_n_indexFASTA                                        -
[ca/8bc788] process > prot_down_workflow:getProtFasta (7240)                       [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_workflow:downloadProtFasta                         -
[-        ] process > build_protein_DB:UncompressFASTA                             -
[-        ] process > build_protein_DB:runDIAMOND_makedb                           -
[-        ] process > alignGenome_Proteins:runDIAMOND_getHSPs_GFF                  -
[-        ] process > matchAssessment:Index_fai                                    -
[-        ] process > matchAssessment:cds_workflow:mergeMatches                    -
[-        ] process > matchAssessment:cds_workflow:filter_by_score                 -
[-        ] process > matchAssessment:cds_workflow:getFASTA                        -
[-        ] process > matchAssessment:cds_workflow:ORF_finder                      -
[-        ] process > matchAssessment:cds_workflow:updateGFFcoords                 -
[-        ] process > matchAssessment:cds_workflow:getFASTA2                       -
[-        ] process > matchAssessment:getCDS_matrices                              -
[-        ] process > matchAssessment:intron_workflow:summarizeMatches             -
[-        ] process > matchAssessment:intron_workflow:pyComputeIntrons             -
[-        ] process > matchAssessment:intron_workflow:removeProtOverlappingIntrons -
[-        ] process > matchAssessment:intron_workflow:getFASTA                     -
[-        ] process > matchAssessment:getIntron_matrices                           -
[-        ] process > matchAssessment:CombineIni                                   -
[-        ] process > matchAssessment:CombineTrans                                 -
[e2/489dc6] process > param_selection_workflow:getParamName (7240)                 [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_selection_workflow:paramSplit                          -
[a8/1450ac] process > param_value_selection_workflow:getParamName (7240)           [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_value_selection_workflow:paramSplitValues              -
[-        ] process > creatingParamFile_frommap                                    -
[-        ] process > geneid_WORKFLOW:Index_i                                      -
[-        ] process > geneid_WORKFLOW:runGeneid_fetching                           -
[-        ] process > prep_concat                                                  -
[-        ] process > concatenate_Outputs_once                                     -
[-        ] process > gff3addInfo:manageGff3sectionSplit                           -
[-        ] process > gff3addInfo:gff3intersectHints                               -
[-        ] process > gff3addInfo:processLabels                                    -
[-        ] process > gff3addInfo:manageGff3sectionMerge                           -
[-        ] process > gff34portal                                                  -
Execution cancelled -- Finishing pending tasks before exit
Oops ...

Error executing process > 'UncompressFASTA (SampleGenomeSmall.fa.gz)'

Caused by:
  Process `UncompressFASTA (SampleGenomeSmall.fa.gz)` terminated with an error exit status (127)

Command executed:

  if [ ! -s  SampleGenomeSmall.fa ]; then
      echo "unzipping genome SampleGenomeSmall.fa.gz"
      gunzip -c SampleGenomeSmall.fa.gz > SampleGenomeSmall.fa;
  fi

Command exit status:
  127

Command output:
  (empty)

Command error:
  docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/etc/shadow" to rootfs at "/etc/shadow": mount /etc/shadow:/etc/shadow (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown.
  time="2023-03-27T09:46:53+02:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/88/1f36528e563427e80b54fd1a2e5099

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

This is my docker version: Docker version 20.10.12, build 20.10.12-0ubuntu4

I feel your pain in making this work for other users. We have the same kind of issues with our containers... I guess, documentation is key, but one has to get there, first.

It is my hope that Geneidx is strong in single exon gene prediction. That's why I am curious to try it.

FerriolCalvet commented 1 year ago

Hi Katharina,

Thank you for the quick reply and for pointing out that the link was broken, it should be fixed now. I am glad to see that now the sample case works with singularity. Regarding the error that you are getting I think I have an explanation for that. The current implementation of this indexing step (not ideal) requires that the fasta file of the genome provided as input is named with the .fa.gz termination. Since your input file is called genome.fasta.masked.gz geneidx is not able to get the names properly, you could try with genome.masked.fa.gz for example. If you could try to change this name and re-run it I would expect this error to disappear, but let me know otherwise and I will propose other solutions, or also if another error appears I can look into it. Thank you!

Regarding docker, I am not an expert in containers so I cannot tell what might be happening. If using singularity is not a big problem for you, I would just use singularity, but anyway I will try to ask some colleagues and see if they have any idea on what might be happening and how it could be solved.

Thank you!

Ferriol

KatharinaHoff commented 1 year ago

Thank you. I was now able to run the experiment that I had intended to run.

FerriolCalvet commented 1 year ago

I am glad that you were able to run it! Any other feedback is welcome.

Thanks,

Ferriol

guigolab / geneidx

geneidx tries to write in location where it should only read & taxon id not found #4