nf-core / smrnaseq

A small-RNA sequencing analysis pipeline
https://nf-co.re/smrnaseq
MIT License
70 stars 118 forks source link

--mirna_gtf for organism with no miRBase GFF file #329

Open OliverH96 opened 3 months ago

OliverH96 commented 3 months ago

Description of the bug

I'm using sheep miRNA data. miRBase contains a few entries for sheep miRNAs but does not provide a gff file on it's download page. I instead used a gff of sheep miRNAs from the RumimiR database (https://rumimir.sigenae.org/), but reach an error at the mirtop_quant step.

I've uploaded the gff file used, but appended the file extension to .txt to allow for uploading.

My params file: input: '/gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/rawData/Fastq/F1_SeminalPlasma_Samplesheet.csv' outdir: '/gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/smrnaseq_output' with_umi: false mirtrace_species: 'oar' fasta: '/gpfs01/home/sbzoh//refGenome/Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.dna.toplevel.fasta' mirna_gtf: '/gpfs01/home/sbzoh//refGenome/rumimir_sheep.gff' mature: '/gpfs01/home/sbzoh//refGenome/mature.fa' hairpin: '/gpfs01/home/sbzoh//refGenome/hairpin.fa' filter_contamination: false skip_mirdeep: true

Command used and terminal output

## Command used
nextflow run nf-core/smrnaseq -profile singularity -params-file params.yaml

## Tail of output containing error
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/smrnaseq] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT'

Caused by:
  Process `NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT` terminated with an error exit status (1)

Command executed:

  #Cleanup the GTF if mirbase html form is broken
  GTF="rumimir_sheep.gff"
  sed 's/&gt;/>/g' $GTF | sed 's#<br>#\n#g' | sed 's#</p>##g' | sed 's#<p>##g' | sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' > ${GTF}_html_cleaned.gtf
  mirtop gff --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar ./bams/*
  mirtop counts --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar --add-extra --gff mirtop/mirtop.gff
  mirtop export --format isomir --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf --sps oar -o mirtop mirtop/mirtop.gff
  mirtop stats mirtop/mirtop.gff --out mirtop/stats
  mv mirtop/stats/mirtop_stats.log mirtop/stats/full_mirtop_stats.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT":
      mirtop: $(echo $(mirtop --version 2>&1) | sed 's/^.*mirtop //')
  END_VERSIONS

Command exit status:
  1

Command output:
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']

Command error:
  /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="?
    if prefix is '':
  03/20/2024 06:34:35 INFO Run annotation
  03/20/2024 06:34:35 ERROR Database not found in --mirna rumimir_sheep.gff_html_cleaned.gtf. Use --database argument to add a custom source.
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']
  Traceback (most recent call last):
    File "/usr/local/bin/mirtop", line 10, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/mirtop/command_line.py", line 31, in main
      reader(kwargs["args"])
    File "/usr/local/lib/python3.9/site-packages/mirtop/gff/__init__.py", line 24, in reader
      database = mapper.guess_database(args)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 23, in guess_database
      return _guess_database_file(args.gtf, args.database)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 40, in _guess_database_file
      raise ValueError("Database not found in %s header" % gff)
  ValueError: Database not found in rumimir_sheep.gff_html_cleaned.gtf header

Work dir:
  /gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/work/d8/42e9ee613e17eb83f5262cfae51a33

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Relevant files

nextflow.log rumimir_sheep.txt

System information

Nextflow version (23.10.1) Hardware (HPC) Executor (slurm) Container engine: (Singularity) OS (CentOS Linux) Version of nf-core/smrnaseq (2.3.0)

OliverH96 commented 3 months ago

Tried again on latest version (2.3.1) and getting the same error.

christopher-mohr commented 2 months ago

Hi @OliverH96, for now you could try to pass the additional argument --database to mirtop using a custom config. This would require adding something like:

process {
        withName: 'MIRTOP_QUANT' {
        ext.args = "--database RumimiR"
    }
}

You have to check if RumimiR is the term used in your provided gff. As far as I understand, mirtop searches for known tags in the gff file and therefore fails in your case.

OliverH96 commented 2 months ago

Hi @OliverH96, for now you could try to pass the additional argument --database to mirtop using a custom config. This would require adding something like:

process {
        withName: 'MIRTOP_QUANT' {
        ext.args = "--database RumimiR"
    }
}

You have to check if RumimiR is the term used in your provided gff. As far as I understand, mirtop searches for known tags in the gff file and therefore fails in your case.

Apologies for getting back to you so late. This did seem to advance the pipeline slightly, but am now getting a different error:

ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT'

Caused by:
  Process `NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT` terminated with an error exit status (1)

Command executed:

  #Cleanup the GTF if mirbase html form is broken
  GTF="rumimir_sheep.gff"
  sed 's/&gt;/>/g' $GTF | sed 's#<br>#\n#g' | sed 's#</p>##g' | sed 's#<p>##g' | sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' > ${GTF}_html_cleaned.gtf
  mirtop gff --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar ./bams/*
  mirtop counts --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps oar --add-extra --gff mirtop/mirtop.gff
  mirtop export --format isomir --hairpin hairpin.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf --sps oar -o mirtop mirtop/mirtop.gff
  mirtop stats mirtop/mirtop.gff --out mirtop/stats
  mv mirtop/stats/mirtop_stats.log mirtop/stats/full_mirtop_stats.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT":
      mirtop: $(echo $(mirtop --version 2>&1) | sed 's/^.*mirtop //')
  END_VERSIONS

Command exit status:
  1

Command output:
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']

Command error:
  /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="?
    if prefix is '':
  05/02/2024 05:12:45 INFO Run annotation
  05/02/2024 05:12:45 INFO Database different than miRBase or MirGeneDB
  05/02/2024 05:12:45 INFO If you get an error when loading,
  05/02/2024 05:12:45 INFO report it to https://github.com/miRTop/mirtop/issues
  ['gff', '--hairpin', 'hairpin.fa_igenome.fa_idx.fa', '--gtf', 'rumimir_sheep.gff_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'oar', './bams/Sire_A_8324_Control_seqcluster.bam', './bams/Sire_A_8401_Control_seqcluster.bam', './bams/Sire_A_8631_Biosolids_seqcluster.bam', './bams/Sire_A_8698_Biosolids_seqcluster.bam', './bams/Sire_B_8335_Control_seqcluster.bam', './bams/Sire_B_8433_Control_seqcluster.bam', './bams/Sire_B_8607_Biosolids_seqcluster.bam', './bams/Sire_B_8796_Biosolids_seqcluster.bam', './bams/Sire_C_8235_Control_seqcluster.bam', './bams/Sire_C_8431_Control_seqcluster.bam', './bams/Sire_C_8747_Biosolids_seqcluster.bam', './bams/Sire_C_8767_Biosolids_seqcluster.bam', './bams/Sire_D_8231_Control_seqcluster.bam', './bams/Sire_D_8416_Control_seqcluster.bam', './bams/Sire_D_8744_Biosolids_seqcluster.bam', './bams/Sire_D_8758_Biosolids_seqcluster.bam']
  Traceback (most recent call last):
    File "/usr/local/bin/mirtop", line 10, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/mirtop/command_line.py", line 31, in main
      reader(kwargs["args"])
    File "/usr/local/lib/python3.9/site-packages/mirtop/gff/__init__.py", line 28, in reader
      matures = mapper.read_gtf_to_precursor(args.gtf)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 172, in read_gtf_to_precursor
      mapped = read_gtf_to_precursor_mirbase(gtf)
    File "/usr/local/lib/python3.9/site-packages/mirtop/mirna/mapper.py", line 333, in read_gtf_to_precursor_mirbase
      id_dict[idname[0]] = name[0]
  IndexError: list index out of range

Work dir:
  /gpfs01/home/sbzoh/F1_Seminal_Plasma_RNA/work/bb/f388eaca99ec7268114f74a3fb2490

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details