nunofonseca / irap

integrated RNA-seq Analysis Pipeline
GNU General Public License v3.0
82 stars 33 forks source link

Fix bug in fasta from gtf with spikes #69

Closed pinin4fjords closed 5 years ago

pinin4fjords commented 6 years ago

This PR addresses an issue in generating transcript files for Kallisto etc using .gtf files and genomic.fasta files, and when spike-ins are included.

Currently, Make logic causes:

  1. GTF file to be made for ERCC controls in .fasta, using spikein_fasta2gtf.pl
  2. ERCC GTF lines to be appended to main GTF
  3. ERCC fasta lines to be appended to main fasta
  4. irap_gtf_to_fasta to be called to produce a cDNA file.
  5. ERCC lines to be concatenated to the result

5 should not be necessary, if ERCC lines were processed correctly at 4. However irap_gtf_to_fasta ignores lines without 'transcript_type' or 'transcript_biotype', which includes all exons created at 1, and therefore only outputs ERCC transcripts at an intermediate stage (eliminating exons). The tophat2_gtf_to_fasta called internally then ignores all these transcripts (working as it does exclusively with exons).

This PR simplifies the logic in the following ways:

I've tested the fix- it works.