kvittingseerup / IsoformSwitchAnalyzeR

An R package to Identify, Annoatate and Visialize Isoform Switches with Functional Consequences (from RNA-seq data)
101 stars 18 forks source link

IsoformSwitchAnalyzeR crashes when importing Stringtie gtf #132

Closed oneillkza closed 1 year ago

oneillkza commented 2 years ago

I'm not sure what format old Stringtie used, but Stringtie2 outputs a gtf file containing both the transcript locations and the quantities.

When I try the method suggested in the vignette, using importIsoformExpression, it doesn't recognise the StringtTie output:

> test_import <- importIsoformExpression(parentDir = '.')
Step 1 of 3: Identifying which algorithm was used...
Error in importIsoformExpression(parentDir = ".") : 
  There were no directories containing the file names/suffixes typically generated by Kallisto/Salmon/RSEM/StringTie. Have you renamed the quantification files? (if so you should probably use the "sampleVector" argument instead).

When I try to import an individual GTF using importGTF, it throws the following error:

> aSwitchList <- importGTF(pathToGTF = '210713_TKRA_0001_RNA002_run1.bam_stringtie.gtf')
Importing GTF (this may take a while)...
Error in getSeqlevelsReplacementMode(value, seqlevels(x)) : 
  the supplied 'seqlevels' must be a character vector with no NAs and no duplicates

Note: the same GTF imports fine using rtracklayer::import.gtf(), so I'll probably hack around this for now. But support for current tools would be good.

oneillkza commented 2 years ago

OK, digging into the source code of importIsoformExpression, it's looking for a file called t_data.ctab. This is apparently BallGown output, which is optional with the current version of StringTie.

So I generated that output, and placed it in separate directories for each sample.

Now I have this issue:

> test_import <- importIsoformExpression(parentDir = '.')
Step 1 of 3: Identifying which algorithm was used...
    The quantification algorithm used was: StringTie
Error in importIsoformExpression(parentDir = ".") : 
  When importing StringTie results the 'readLength' argument must be specified.
 This argument must be set to the number of base pairs sequenced (e.g. if the 
 quantified data is 75 bp paired ends 'readLength' should be set to 75.

My data is Nanopore sequence data -- the read length is highly variable. I've passed it the median read length, and it seems to work, but what impact is this likely to have on the analysis?

oneillkza commented 2 years ago

I still can't import the merged gtf from stringtie2. But it looks like that may be due to some entried with '*' for the strand. (makeTxDbFromGFF doesn't like this). I took a look at those entries, and they all look erroneous, and have no expression in the actual samples, so this may be a stringtie2 bug.

kvittingseerup commented 2 years ago

Thanks for eaching out :-)

Just for clarity StringTie (and StringTie2 - they are now merged so only one exists) can create the quantification files by setting the -eB flag.

With regards to the read length, I guess the median is as good as it gets. An alternative would be to quantify the StringTie generated transcriptome with a tool with a better quantification engine such as Salmon or Kallisto.

With regards to the GTF file that sounds strange - which version of IsoformSwitchAnalyzeR are you using? Also I've never seen un-stranded transcripts from StringTie - is that common?

oneillkza commented 2 years ago

Thanks @kvittingseerup . I am running version 1.16.0, which I believe is the latest one from Bioconductor.

I'm fairly sure the issue with strandless transcripts is a bug in Stringtie. Here's an example entry:

chr1    StringTie   transcript  24556443    24557027    1000    .   .   gene_id "MSTRG.440"; transcript_id "MSTRG.440.1"; 
chr1    StringTie   exon    24556443    24557027    1000    .   .   gene_id "MSTRG.440"; transcript_id "MSTRG.440.1"; exon_number "1"; 

There are only a handful of these, all are for "novel" genes, and none of them register as being expressed in either of the samples I'm comparing when I re-run stringtie using this annotation. That said, I did try filtering those entries out, and importGTF still throws this error:

Error in getSeqlevelsReplacementMode(value, seqlevels(x)) : 
  the supplied 'seqlevels' must be a character vector with no NAs and no duplicates

I was able to get it to work in a very round about way by doing the following:

1) Import using rtracklayer::import.gff. 2) Filter the resulting granges to only the exon entries. 3) Remove the 'type' column, since importRdata thinks its presence means there are still non-exon entries. 4) Copy the gene_id column into the ref_gene_id column. 5) Set fixStringTieAnnotationProblem = FALSE when running imporRdata, or it crashes somewhere while trying to handle the ref_gene_id column.

But it might be good if importGTF would just work. Please let me know if you'd like me to send you the annotation gtf for debugging.

kvittingseerup commented 2 years ago

I agree. Could you email me the GTF so I can update importGTF ?

I would naturally keep it confidential and only use it for debugging.

oneillkza commented 2 years ago

Thanks @kvittingseerup . I've emailed you at your DTU address with a link to the file.

kvittingseerup commented 2 years ago

Could you try and use IsoformSwitchAnalyzeR v 1.16.03 or newer?

kvittingseerup commented 1 year ago

Closing due to no activity. If you respond I will open it again :)