nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
885 stars 702 forks source link

incorrect gene_bed format #1063

Closed zhouzhendiao closed 11 months ago

zhouzhendiao commented 1 year ago

Description of the bug

My RNA-seq datasets were captured by TrueSeq ranther than by PolyA, so I set parameter --gene_bed manually. The TrueSeq bed only has 4 columns:

browser position chr1:14695-14814
track name="Covered" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V8 - Genomic regions covered by probes" color=0,0,128 db=hg19
chr1    14694   14814   ref|WASH7P,ref|NR_024540,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000488147,ens|ENST00000541675,ens|ENST00000423562
chr1    14928   15048   ref|WASH7P,ref|NR_024540,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000488147,ens|ENST00000541675,ens|ENST00000423562
chr1    15752   15948   ref|WASH7P,ref|NR_024540,ens|ENST00000538476,ens|ENST00000438504,ens|ENST00000488147,ens|ENST00000541675,ens|ENST00000423562
chr1    16603   17068   ref|WASH7P,ref|NR_024540,ens|ENST00000538476,ens|ENST00000438504,ens|ENST00000488147,ens|ENST00000541675,ens|ENST00000423562
chr1    17235   17421   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000541675,ens|ENST00000423562,miRNA|hsa-miR-6859-3p
chr1    17604   17724   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000541675,ens|ENST00000423562
chr1    17893   18097   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000541675,ens|ENST00000423562
chr1    18250   18390   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000538476,ens|ENST00000438504,ens|ENST00000541675,ens|ENST00000423562
chr1    24768   24888   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000538476,ens|ENST00000438504,ens|ENST00000541675,ens|ENST00000423562
chr1    65467   65587   -
chr1    69028   69630   ref|OR4F5,ref|NM_001005484,ens|ENST00000335137,ccds|CCDS30547
chr1    69674   70013   ref|OR4F5,ref|NM_001005484,ens|ENST00000335137,ccds|CCDS30547
chr1    129155  129273  ens|ENST00000453576
chr1    173732  173852  ens|ENST00000466557
chr1    334078  334304  ens|ENST00000455207,ens|ENST00000431812,ens|ENST00000455464,ens|ENST00000601814,ens|ENST00000445840
chr1    367681  368620  ref|OR4F3,ref|OR4F29,ref|OR4F16,ref|NM_001005221,ref|NM_001005224,ref|NM_001005277,ens|ENST00000455207,ens|ENST00000426406

Error occured when I run until RSEQC_READDISTRIBUTION step:

-[nf-core/rnaseq] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:BAM_RSEQC:RSEQC_READDISTRIBUTION (GM1004CT)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:BAM_RSEQC:RSEQC_READDISTRIBUTION (GM1004CT)` terminated with an error exit status (1)

Command executed:

  read_distribution.py \
      -i GM1004CT.markdup.sorted.bam \
      -r S33266436_Covered.bed \
      > GM1004CT.read_distribution.txt

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:BAM_RSEQC:RSEQC_READDISTRIBUTION":
      rseqc: $(read_distribution.py --version | sed -e "s/read_distribution.py //g")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Skipping mount /var/lib/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  processing S33266436_Covered.bed ... Traceback (most recent call last):
    File "/usr/local/bin/read_distribution.py", line 302, in <module>
      main()
    File "/usr/local/bin/read_distribution.py", line 191, in main
      intergenic_down1kb_base,intergenic_down5kb_base,intergenic_down10kb_base) = process_gene_model(options.ref_gene_model)
    File "/usr/local/bin/read_distribution.py", line 75, in process_gene_model
      utr_3 = obj.getUTR(utr=3)
    File "/usr/local/lib/python3.7/site-packages/qcmodule/BED.py", line 430, in getUTR
      strand=fields[5]
  IndexError: list index out of range

By default, ranseq pipeline will generate gene_bed from gtf file. The required bed format seems like BED6/BED12.Corresponding value of fields[5] is strand info(+/-).

So how can I change the bed format?

Command used and terminal output

No response

Relevant files

No response

System information

No response

pinin4fjords commented 11 months ago

@zhouzhendiao Thank you for reaching out and providing detailed information regarding the issue you've encountered with the nf-core RNA-seq pipeline. It seems that there is a format incompatibility with the BED file you are using.

The error message you are seeing is indicative of a mismatch between the expected input format for the read_distribution.py script and the format of your BED file. The script is expecting a BED file with at least six columns, where the sixth column contains strand information (either '+' or '-'). However, your BED file seems to have only four columns and does not include strand information, which is why the script fails with an IndexError.

The nf-core RNA-seq pipeline typically requires a BED6 or BED12 format for accurate processing, where:

To resolve this issue, you will need to convert your 4-column BED file into a BED6 or BED12 format. This involves adding two additional columns for the score and strand. The score can often be set to a default value (such as 0 or .) if not used, and the strand can be set to +, -, or . if the strand information is not available.

Here is an example of how to format a BED6 file from your existing data:

chr1    14694   14814   ref|WASH7P    0   +
chr1    14928   15048   ref|WASH7P    0   +
...

Please ensure that the strand information (+/-) is accurate for your data. If you do not have this information, you may use . as a placeholder, but be aware that this might affect downstream analysis that is strand-specific.

If you require assistance with converting your BED file to the appropriate format, there are tools available that can help automate this process, such as awk for command-line text processing or more specialized bioinformatics tools.

Once you have a correctly formatted BED file, you should be able to rerun the pipeline without encountering the previous error.

Since this is unrelated to the functionality of this pipeline I'm closing the issue. Good luck with your project.

zhouzhendiao commented 11 months ago

Hi @pinin4fjords ,

Thanks for your detailed tutorial!