nf-core / smrnaseq

A small-RNA sequencing analysis pipeline
https://nf-co.re/smrnaseq
MIT License
71 stars 120 forks source link

INDEX_GENOME: Bowtie build error #340

Open AhmedMohamed1993 opened 4 months ago

AhmedMohamed1993 commented 4 months ago

Description of the bug

Error in the genome index step using both 2.3.0 or dev versions using the command below. All reference files are from mirbase and fasta from Ensembl. The test run worked properly. Any suggestions on what is causing the issue?

Command used and terminal output

$nextflow run nf-core/smrnaseq -r dev --input 'SampleSheet.csv' --outdir '/results' \
--mirtrace_species hsa --fasta 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' \
--hairpin 'mature.fa' \
--mature 'hairpin.fa' \
--mirna_gtf 'hsa.gff3' \
--skip_mirdeep --protocol 'qiaseq' -profile singularity

Output:
ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:INDEX_GENOME (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)'

Caused by:
  Process `NFCORE_SMRNASEQ:INDEX_GENOME (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)` terminated with an error exit status (1)

Command executed:

  # Remove any special base characters from reference genome FASTA file
  sed '/^[^>]/s/[^ATGCatgc]/N/g' Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > genome.edited.fa
  sed -i 's/ .*//' genome.edited.fa

  # Build bowtie index
  bowtie-build genome.edited.fa genome --threads 6

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:INDEX_GENOME":
      bowtie: $(echo $(bowtie --version 2>&1) | sed 's/^.*bowtie-align-s version //; s/ .*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  Settings:
    Output files: "genome.*.ebwt"
    Line rate: 6 (line is 64 bytes)
    Lines per side: 1 (side is 64 bytes)
    Offset rate: 5 (one in 32)
    FTable chars: 10
    Strings: unpacked
    Max bucket size: default
    Max bucket size, sqrt multiplier: default
    Max bucket size, len divisor: 24
    Difference-cover sample period: 1024
    Endianness: little
    Actual local endianness: little
    Sanity checking: disabled
    Assertions: disabled
    Random seed: 0
    Sizeofs: void*:8, int:4, long:8, size_t:8
  Input files DNA, FASTA:
    genome.edited.fa
  Reading reference sizes
    Time reading reference sizes: 00:00:10
  Calculating joined length
  Writing header
  Reserving space for joined string
  Joining reference sequences
    Time to join reference sequences: 00:00:00
  Total time for call to driver() for forward index: 00:00:10

Command error:
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
  Warning: Encountered empty reference sequence
  Warning: Encountered reference sequence with only gaps
    Time reading reference sizes: 00:00:10
  Calculating joined length
  Writing header
  Reserving space for joined string
  Joining reference sequences
  Reference file does not seem to be a FASTA file
    Time to join reference sequences: 00:00:00
  Total time for call to driver() for forward index: 00:00:10
  Command: bowtie-build --wrapper basic-0 --threads 6 genome.edited.fa genome

Relevant files

No response

System information

No response

christopher-mohr commented 4 months ago

Hi @AhmedMohamed1993, did you try with the extracted (not .gz) fasta file?

AhmedMohamed1993 commented 4 months ago

The extraction helped but stops at different point now.

ERROR ~ Error executing process > 'NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT'

Caused by: Process NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT terminated with an error exit status (1)

Command executed:

Cleanup the GTF if mirbase html form is broken

GTF="hsa.gff3" sed 's/>/>/g' $GTF | sed 's#
#\n#g' | sed 's#

##g' | sed 's#

##g' | sed -e :a -e '/^\n$/{$d;N;};/\n$/ba' > ${GTF}_html_cleaned.gtf mirtop gff --hairpin mature.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps hsa ./bams/ mirtop counts --hairpin mature.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf -o mirtop --sps hsa --add-extra --gff mirtop/mirtop.gff mirtop export --format isomir --hairpin mature.fa_igenome.fa_idx.fa --gtf ${GTF}_html_cleaned.gtf --sps hsa -o mirtop mirtop/mirtop.gff mirtop stats mirtop/mirtop.gff --out mirtop/stats mv mirtop/stats/mirtop_stats.log mirtop/stats/full_mirtop_stats.log

cat <<-END_VERSIONS > versions.yml "NFCORE_SMRNASEQ:MIRNA_QUANT:MIRTOP_QUANT": mirtop: $(echo $(mirtop --version 2>&1) | sed 's/^.*mirtop //') END_VERSIONS

Command exit status: 1

Command output: ['gff', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'hsa', './bams/27_post_seqcluster.bam', './bams/28_post_seqcluster.bam', './bams/29_post_seqcluster.bam'] ['counts', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'hsa', '--add-extra', '--gff', 'mirtop/mirtop.gff'] ['export', '--format', 'isomir', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '--sps', 'hsa', '-o', 'mirtop', 'mirtop/mirtop.gff'] ['stats', 'mirtop/mirtop.gff', '--out', 'mirtop/stats']

Command error: 04/13/2024 04:04:15 INFO Filtered by being duplicated: 0 04/13/2024 04:04:15 INFO Filtered by being outside miRNA positions: 18784 04/13/2024 04:04:15 INFO Filtered by being low score: 0 04/13/2024 04:04:17 INFO It took 0.426 minutes ['gff', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'hsa', './bams/27_post_seqcluster.bam', './bams/28_post_seqcluster.bam', './bams/29_post_seqcluster.bam'] /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="? if prefix is '': 04/13/2024 04:04:20 INFO Run convert of GFF to TSV containing expression 04/13/2024 04:04:20 INFO INFO Reading GFF file mirtop/mirtop.gff 04/13/2024 04:04:20 INFO INFO Writing TSV file to directory mirtop 04/13/2024 04:04:20 INFO Missing Parents in hairpin file: 0 04/13/2024 04:04:20 INFO Missing MiRNAs in GFF file: 0 04/13/2024 04:04:20 INFO Non valid UID: 0 04/13/2024 04:04:20 INFO Output file is at mirtop/mirtop.tsv 04/13/2024 04:04:20 INFO It took 0.001 minutes ['counts', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '-o', 'mirtop', '--sps', 'hsa', '--add-extra', '--gff', 'mirtop/mirtop.gff'] /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="? if prefix is '': 04/13/2024 04:04:22 INFO Run export of GFF into other format. 04/13/2024 04:04:22 INFO INFO Writing TSV file to directory mirtop 04/13/2024 04:04:22 INFO INFO Reading GFF file mirtop/mirtop.gff 04/13/2024 04:04:22 INFO Missing Parents in hairpin file: 0 04/13/2024 04:04:22 INFO Missing MiRNAs in GFF file: 0 04/13/2024 04:04:22 INFO Non valid UID: 0 04/13/2024 04:04:22 INFO Output file is at mirtop/mirtop_rawData.tsv 04/13/2024 04:04:22 INFO It took 0.001 minutes ['export', '--format', 'isomir', '--hairpin', 'mature.fa_igenome.fa_idx.fa', '--gtf', 'hsa.gff3_html_cleaned.gtf', '--sps', 'hsa', '-o', 'mirtop', 'mirtop/mirtop.gff'] /usr/local/lib/python3.9/site-packages/mirtop/mirna/mintplates.py:512: SyntaxWarning: "is" with a literal. Did you mean "=="? if prefix is '': 04/13/2024 04:04:24 INFO Run stats. 04/13/2024 04:04:24 INFO Reading: mirtop/mirtop.gff ['stats', 'mirtop/mirtop.gff', '--out', 'mirtop/stats'] Traceback (most recent call last): File "/usr/local/bin/mirtop", line 10, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/mirtop/command_line.py", line 34, in main stats(kwargs["args"]) File "/usr/local/lib/python3.9/site-packages/mirtop/gff/stats.py", line 38, in stats out.append(_calc_stats(fn)) File "/usr/local/lib/python3.9/site-packages/mirtop/gff/stats.py", line 82, in _calc_stats df = _summary(lines) File "/usr/local/lib/python3.9/site-packages/mirtop/gff/stats.py", line 130, in _summary df_sum = _add_missing(df_sum) File "/usr/local/lib/python3.9/site-packages/mirtop/gff/stats.py", line 110, in _add_missing df2 = {'category': category, 'sample': df['sample'].iat[0], 'counts': 0} File "/usr/local/lib/python3.9/site-packages/pandas/core/indexing.py", line 2221, in getitem return self.obj._get_value(*key, takeable=self._takeable) File "/usr/local/lib/python3.9/site-packages/pandas/core/series.py", line 1066, in _get_value return self._values[label] IndexError: index 0 is out of bounds for axis 0 with size 0

christopher-mohr commented 4 months ago

Does it work if you do not specify --mirna_gtf hsa.gff3?

lpantano commented 1 month ago

I am happy to help with this, sorry I am late, starting to work on this pipeline more now.

If you still have access to the working directory where this error happens, I am happy to look at the files and see what is going on.Thanks!