kvittingseerup / IsoformSwitchAnalyzeR

An R package to Identify, Annoatate and Visialize Isoform Switches with Functional Consequences (from RNA-seq data)
101 stars 18 forks source link

addORFfromGTF() not recognizing CDS #166

Closed ceOSU closed 1 year ago

ceOSU commented 1 year ago

I have been working with data that I first mapped with HISAT2 then did my alignment, merge, and quantification with StringTie. When I try to use addORFfromGTF I get an error saying there isn't any CDS information in the GTF. I'm using isoformSwitchAnalyzeR version 2.1.3.

Here's everything I've run in R up to this point:

library(IsoformSwitchAnalyzeR)
EFTUD2quant = importIsoformExpression(parentDir = "C:/Users/Caleb/OneDrive - The Ohio State University/BioinfoData/IsoformSwitch/ENCODE_EFTUD2",
                                      addIsofomIdAsColumn = TRUE, showProgress = TRUE, readLength = 100) #reads the t_data.ctab file in each sub directory
names(EFTUD2quant)

#Generate the list of isoform switches
myDesign = data.frame(sampleID = c("SRR4421357","SRR4421358","SRR4422087","SRR4422088"),
                      condition = c("KD","KD","WT","WT"))
SwitchList = importRdata(isoformCountMatrix = EFTUD2quant$counts,
                         isoformRepExpression = EFTUD2quant$abundance,
                         designMatrix = myDesign,
                         isoformExonAnnoation = "C:/Users/Caleb/OneDrive - The Ohio State University/BioinfoData/IsoformSwitch/ENCODE_EFTUD2/EFTUD2_stringtie_merged.gtf",
                         showProgress = TRUE)
SwitchList_filt = preFilter(SwitchList,
                            geneExpressionCutoff = 1, #Cut off genes with less than 1 TPM
                            isoformExpressionCutoff = 0, #removes unused isoforms
                            removeSingleIsoformGenes = TRUE, #removes genes with a single isoform
                            ) #The filtering removed 71938 ( 44.11% of ) transcripts. There is now 91136 isoforms left

#Perform switch test (will take a while)
AnalyzedSwitch = isoformSwitchTestDEXSeq(switchAnalyzeRlist = SwitchList_filt, #Uses DEXseq to analyze the isoform switches
                                         alpha = 0.05, #The FDR p-value has to be less than this
                                         dIFcutoff = 0.1, #THere must be at least a 10% change for the isoforms to be considered
                                         reduceToSwitchingGenes = TRUE, #This only keeps genes with one isoform significantly differently used, makes it run faster
                                         showProgress = TRUE)

#Identify ORFs
SwitchAndORF = addORFfromGTF(AnalyzedSwitch,
                             pathToGTF = "C:/Users/Caleb/OneDrive - The Ohio State University/BioinfoData/IsoformSwitch/Homo_sapiens.GRCh38.108.gtf")

The error message I get is: Step 1 of 2: importing GTF (this may take a while)... Step 2 of 2: Adding ORF... Error in addORFfromGTF(AnalyzedSwitch, pathToGTF = "C:/Users/Caleb/OneDrive - The Ohio State University/BioinfoData/IsoformSwitch/Homo_sapiens.GRCh38.108.gtf") : No ORFs could be added to the switchAnalyzeRlist. Please ensure GTF file have CDS info (and that isoform ids match). Also note that ORF/CDS information cannot be found in the output from asemblers such as StringTie and Cufflinks.Instead use the official GTF that you downloaded from an official source such as Ensembl, Gencode, etc.

The GTF I'm trying to use is what I used for StringTie initially, and I got it straight from Ensembl. I've also tried the new version on Ensembl (Homo_sapiens.GRCH38.109.gtf), as well as the alternate versions on Ensembl and I get the same error.

kvittingseerup commented 1 year ago

That sounds strange. Could you post the first 10 lines from the GTF and the first few lines when using head() on your AnalyzedSwitch object?

ceOSU commented 1 year ago

I ended up getting it to work by changing ignoreAfterPeriod to TRUE. But for posterity here is the head of the GTF and of my analyzed switch obejct.

head(GTFhead)
                iso_ref          gene_ref        isoform_id            gene_id  condition_1  condition_2 gene_name
106168 isoComp_00000001 geneComp_00000001 ENST00000373020.9 ENSG00000000003.15 plaseholder1 plaseholder2    TSPAN6
106172 isoComp_00000002 geneComp_00000001 ENST00000494424.1 ENSG00000000003.15 plaseholder1 plaseholder2    TSPAN6
106171 isoComp_00000003 geneComp_00000001 ENST00000496771.5 ENSG00000000003.15 plaseholder1 plaseholder2    TSPAN6
106169 isoComp_00000004 geneComp_00000001 ENST00000612152.4 ENSG00000000003.15 plaseholder1 plaseholder2    TSPAN6
106170 isoComp_00000005 geneComp_00000001 ENST00000614008.4 ENSG00000000003.15 plaseholder1 plaseholder2    TSPAN6
106166 isoComp_00000006 geneComp_00000002 ENST00000373031.5  ENSG00000000005.6 plaseholder1 plaseholder2      TNMD
       ref_gene_id   gene_biotype                    iso_biotype class_code gene_overall_mean gene_value_1 gene_value_2
106168      TSPAN6 protein_coding                 protein_coding          =                 0            0            0
106172      TSPAN6 protein_coding protein_coding_CDS_not_defined          =                 0            0            0
106171      TSPAN6 protein_coding protein_coding_CDS_not_defined          =                 0            0            0
106169      TSPAN6 protein_coding                 protein_coding          =                 0            0            0
106170      TSPAN6 protein_coding                 protein_coding          =                 0            0            0
106166        TNMD protein_coding                 protein_coding          =                 0            0            0
       gene_stderr_1 gene_stderr_2 gene_log2_fold_change gene_p_value gene_q_value iso_overall_mean iso_value_1
106168            NA            NA                     0            1            1                0           0
106172            NA            NA                     0            1            1                0           0
106171            NA            NA                     0            1            1                0           0
106169            NA            NA                     0            1            1                0           0
106170            NA            NA                     0            1            1                0           0
106166            NA            NA                     0            1            1                0           0
       iso_value_2 iso_stderr_1 iso_stderr_2 iso_log2_fold_change iso_p_value iso_q_value IF_overall IF1 IF2 dIF
106168           0           NA           NA                    0           1           1         NA  NA  NA  NA
106172           0           NA           NA                    0           1           1         NA  NA  NA  NA
106171           0           NA           NA                    0           1           1         NA  NA  NA  NA
106169           0           NA           NA                    0           1           1         NA  NA  NA  NA
106170           0           NA           NA                    0           1           1         NA  NA  NA  NA
106166           0           NA           NA                    0           1           1         NA  NA  NA  NA
       isoform_switch_q_value gene_switch_q_value   PTC
106168                     NA                  NA FALSE
106172                     NA                  NA    NA
106171                     NA                  NA    NA
106169                     NA                  NA FALSE
106170                     NA                  NA FALSE
106166                     NA                  NA FALSE

 head(AnalyzedSwitch)
               iso_ref          gene_ref      isoform_id         gene_id condition_1 condition_2 gene_name gene_biotype
85624 isoComp_00000020 geneComp_00000004 ENST00000367770 ENSG00000000457          KD          WT     SCYL3           NA
85625 isoComp_00000021 geneComp_00000004 ENST00000367771 ENSG00000000457          KD          WT     SCYL3           NA
85626 isoComp_00000022 geneComp_00000004 ENST00000367772 ENSG00000000457          KD          WT     SCYL3           NA
85627 isoComp_00000023 geneComp_00000004 ENST00000423670 ENSG00000000457          KD          WT     SCYL3           NA
85629 isoComp_00000025 geneComp_00000004    MSTRG.2315.2 ENSG00000000457          KD          WT     SCYL3           NA
85630 isoComp_00000026 geneComp_00000004    MSTRG.2315.3 ENSG00000000457          KD          WT     SCYL3           NA
      iso_biotype gene_overall_mean gene_value_1 gene_value_2 gene_stderr_1 gene_stderr_2 gene_log2_fold_change
85624          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
85625          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
85626          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
85627          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
85629          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
85630          NA          3.249689     2.953907     3.545471     0.0184389     0.1800441             0.2625402
      gene_q_value iso_overall_mean iso_value_1 iso_value_2 iso_stderr_1 iso_stderr_2 iso_log2_fold_change iso_q_value
85624           NA        0.1869044  0.37380874 0.000000000   0.37380874  0.000000000         -5.262315644          NA
85625           NA        0.8153609  0.81336773 0.817353996   0.45779004  0.068351602          0.006967842          NA
85626           NA        0.2935263  0.48205034 0.105002350   0.05046671  0.013941773         -2.097142589          NA
85627           NA        0.3537878  0.01020881 0.697366802   0.01020881  0.031190193          5.129402086          NA
85629           NA        0.5025713  0.99984745 0.005295231   0.98019742  0.005295231         -6.044911621          NA
85630           NA        1.0912258  0.27405838 1.908393210   0.20672966  0.302522729          2.755639083          NA
      IF_overall     IF1     IF2      dIF isoform_switch_q_value gene_switch_q_value
85624   0.062875 0.12575 0.00000 -0.12575           8.009326e-01         2.62214e-08
85625   0.253250 0.27440 0.23210 -0.04230           9.937467e-01         2.62214e-08
85626   0.096575 0.16330 0.02985 -0.13345           9.210517e-04         2.62214e-08
85627   0.100550 0.00345 0.19765  0.19420           2.622140e-08         2.62214e-08
85629   0.171050 0.34055 0.00155 -0.33900           5.943468e-01         2.62214e-08
85630   0.313850 0.09235 0.53535  0.44300           6.028529e-02         2.62214e-08
kvittingseerup commented 1 year ago

Thanks for letting me know. I'll add that info to the error message :)