jmw86069 / splicejam

Sashimi plots for RNA-seq data using detected transcripts
https://splicejam.vtc.vt.edu
27 stars 6 forks source link

Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds #1

Closed eric423627 closed 3 years ago

eric423627 commented 4 years ago

Hi, I am using your great software want to call differential splicing. However, when i reached here, there was a problem: diffSpliceL <- runDiffSplice( iMatrixTx=iMatrixTxTPM, detectedTx= detectedTx, tx2geneDF=tx2geneDF, txColname=txColname, geneColname=geneColname, iDesign=iDesign, iContrasts=iContrasts, cutoffFDR=0.05, cutoffFold=1.5, collapseByGene=TRUE, spliceTest="t", verbose=FALSE, useVoom=FALSE);

Saying: Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds How can i solve them?

I use mouse as species, i followed your pipeline, all is well, only one part when i do geneMatch <- match(rownames(txiTx[[assayNames[1]]]), tx2geneDF[,txColname]); It seems the order is chaotic, so i use this following instead. assays1=lapply(txiTx[assayNames], function(x){log2(1+x)}) rowData1=DataFrame(tx2geneDF[geneMatch,,drop=FALSE]) colData1=DataFrame(sampleDF) rownames(rowData1) <- NULL TxSE <- SummarizedExperiment( assays=assays1, rowData=rowData1, colData=colData1, );

This "rownames(rowData1) <- NULL" is what i add.

jmw86069 commented 4 years ago

Hi Eric, thank you for the comments and the bug report, I shall take a look!

By eye it appears the issue is that your detectedTx values are not all present inrownames(iMatrixTxTPM). From what you described, I think you might have no rownames(TxSE), which I think will cause problems with downstream functions.

In theory, rownames(TxSE) are taken from rownames(rowData), and are expected to match all rownames in assays. For example, whenever assays(TxSE)[[1]] are returned, the rownames are inherited from rownames(TxSE). The values of rownames(assays[[1]]) are expected to have values from tx2geneDF[[txColname]] to indicate each transcript.

The geneMatch logic is intended to align the txiTx data rows with the tx2geneDF rows. The txiTx and tx2geneDF are not expected to be in the same order, so you're correct that it is probably very chaotic! I think that chaos is to be expected, so I hope that if you use geneMatch it will work? :) I am not sure but I hope so!

If it does not work, can you post the output of these commands?

print(head(rownames(TxSE)))
print(head(rownames(iMatrixTxTPM)))
print(head(detectedTx))
print(head(tx2geneDF))

Again thank you for taking the time to post this bug, I appreciate the feedback!

eric423627 commented 4 years ago

Hi thanks for your reply, If i do not add the NULL line, In this command:

assayNames <- intersect(c("abundance","counts"), names(txiTx)); geneMatch <- match(rownames(txiTx[[assayNames[1]]]), tx2geneDF[,txColname]); sampleDF <- data.frame( Sample=colnames(txiTx[[assayNames[1]]]), Group=rep(c("control","high"), each=3), length.out=ncol(txiTx[[assayNames[1]]])); assays1=lapply(txiTx[assayNames], function(x){log2(1+x)}) rowData1=DataFrame(tx2geneDF[geneMatch,,drop=FALSE]) colData1=DataFrame(sampleDF)

rownames(rowData1) <- NULL

TxSE <- SummarizedExperiment( assays=assays1, rowData=rowData1, colData=colData1, );

it continues saying: Error in .validate_names(rownames, ans_rownames, "assay rownames()", "rowData rownames() / rowRanges names()") : assay rownames() must be NULL or identical to rowData rownames() / rowRanges names()

I examined the order and it is chaotic. And that is the reason why i add a make rowname NULL line into the script.

And as you know, if this can not be performed then the TxSE can not be generated.

Thanks for your time for helping solve the problem.

Best,

And

jmw86069 commented 4 years ago

Ah I think I understand the problem now. The rownames(assays) are not identical to rownames(rowData).

See if this helps?

rownames(tx2geneDF) <- tx2geneDF[[txColname]];

Then as a spot-check you could test to see if rownames for txiTx are present in rownames(tx2geneDF):

# frequency of rownames in txiTx that are present in tx2geneDF
# all items should be TRUE
table( rownames(txiTx[[1]]) %in% rownames(tx2geneDF) )

Everything should be TRUE, hopefully.

Then for curiosity, test how many rows from tx2geneDF are present in txiTx:

# frequency of rownames in tx2geneDF that are present in txiTx
# most but maybe not all items will be TRUE
table( rownames(tx2geneDF) %in% rownames(txiTx[[1]]) )

I have noticed that sometimes txiTx does not contain every entry in tx2geneDF. But I expect every rowname in txiTx to be present in tx2geneDF.

Let me know if this helps!

eric423627 commented 4 years ago

Thank you so much for your reply. I tried your command, most of them are true, but some are false.

rownames(tx2geneDF) <- tx2geneDF[[txColname]] table( rownames(txiTx[[1]]) %in% rownames(tx2geneDF) ):

FALSE TRUE 1959 113311

Should i just remove these FALSE or you have any alternative way?

Many thanks

jmw86069 commented 4 years ago

The answer you might expect, "it depends". :)

In my case, the rows in txiTx that were not in tx2geneDF were due to different versions of the GTF file. The Salmon GTF file was a slightly older Gencode GTF than the Gencode GTF I was using for my analysis.

So in that case, I did not want to remove FALSE entries, but instead use the matched GTF file.

You may have already done this, but I suggest inspecting a few entries in txiTx which are not found in tx2geneDF:

head(setdiff(rownames(txiTx[[1]]), rownames(tx2geneDF)), 20)

If the entries have a version suffix, something like "ENSMUST00000012.1" you might try looking for the entry without the version in tx2geneDF, like this:

subset(tx2geneDF, grepl("ENSMUST00000012", transcript_id))

I'm not sure what your GTF file looks like, but my guess is that Salmon/Kallisto might be slightly modifying the transcript_id, causing a mismatch.

eric423627 commented 4 years ago

Hi James,

Thanks so much for your help. I kind of understand the problem, I will try to look for the matched GTFs. I will keep updated if it works. Thank you.

eric423627 commented 4 years ago

Hi James,

Now this problem is fixed. All the 140000 are TRUE using your command:

table( rownames(tx2geneDF) %in% rownames(txiTx[[1]]) )

Everything looks great also generates nice MA-plot. However, the out of bound problem remains unsolved.

(17:01:13) 26Feb2020: iDesign:

              control high

FA-control1_quant 1 0 FA-control2_quant 1 0 FA-control3_quant 1 0 FA-high1_quant 0 1 FA-high2_quant 0 1 FA-high3_quant 0 1

(17:01:13) 26Feb2020: iContrasts:

     Contrasts

Levels high-control control -1 high 1

runDiffSplice()

if (exists("TxSE")) {

  • all(colnames(TxSE) == iSamples)
  • all(iSamples == rownames(iDesign))
  • all(colnames(iDesign) == rownames(iContrasts))
  • iDesign %*% iContrasts;
  • } Contrasts high-control FA-control1_quant -1 FA-control2_quant -1 FA-control3_quant -1 FA-high1_quant 1 FA-high2_quant 1 FA-high3_quant 1 iMatrixTxTPM <- assays(TxSE[,iSamples])[["abundance"]]; iMatrixTxTPM[iMatrixTxTPM == 0] <- NA; diffSpliceL <- runDiffSplice(
  • iMatrixTx=iMatrixTxTPM,
  • detectedTx = rownames(iMatrixTxTPM),
  • tx2geneDF=tx2geneDF,
  • txColname=txColname,
  • geneColname=geneColname,
  • iDesign=iDesign,
  • iContrasts=iContrasts,
  • cutoffFDR=0.05,
  • cutoffFold=1.5,
  • collapseByGene=TRUE,
  • spliceTest="t",
  • verbose=FALSE,
  • useVoom=FALSE); Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds

Do you know what is the problem? Thank you so much for your patience.

jmw86069 commented 4 years ago

I apologize for the issues, I'm glad you're patient also!

Can you try verbose=TRUE and see if it gives informative output? I see that internally it uses your detectedTx then queries tx2geneDF to find genes present multiple times. The verbose output may give some insight into whether it is failing during this step.

Also is uses the column in tx2geneDF defined by the argument txColname, make sure that is the correct column name to use. The values in that column should correspond to rownames(iMatrixTxTPM).

eric423627 commented 4 years ago

Hi Thank you for your reply. It did give some instructions:

diffSpliceL <- runDiffSplice(

  • iMatrixTx=iMatrixTxTPM,
  • detectedTx=detectedTx,
  • tx2geneDF=tx2geneDF,
  • txColname=txColname,
  • geneColname=geneColname,
  • iDesign=iDesign,
  • iContrasts=iContrasts,
  • cutoffFDR=0.05,
  • cutoffFold=1.5,
  • collapseByGene=TRUE,
  • spliceTest="t",
  • verbose=TRUE,
  • useVoom=FALSE);

    (10:10:12) 02Mar2020: runDiffSplice(): Started with 36,328 entries representing 16,007 genes.

    (10:10:12) 02Mar2020: runDiffSplice(): Ended with 28,735 entries representing 8,414 multi-entry genes.

    Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds

jmw86069 commented 4 years ago

I have at last found the source of the problem. Forgive the delay in following up.

The issue originates with Gencode, sorry to point a finger at them. They provide a GTF file, and associated transcripts.fasta file -- but the two files for the same version of Gencode, do not use the same transcript identifiers. sigh Due respect, it's not an easy job. But we fully expect Gencode v32lift37 to be consistent with itself.

For example:

The same discrepancy is seen with gene_id and havana_gene.

This issue must be documented elsewhere -- it throws warnings when using Salmon on transcript alignments, because the ID values do not match.

I have a workaround, but will need to do some testing first. Removing the _# version is not sufficient, because we want tximport() to succeed using tx2gene, which means the values must match. I think the answer is to import transcript-quant data, then convert to gene with tximport::summarizeToGene() and a modified tx2gene.

I can't imagine a scenario where we would ignore expression data for these entries -- presumably transcripts with frequent updates are most susceptible, and any such transcripts may or may not be of high priority to a given project.

jmw86069 commented 4 years ago

FYI I have encountered this error a couple more times, and figured out the best workaround as I understand it.

I think I need a function to make these steps easier, except that it is probably only relevant to Gencode data... Suggestions welcome. :)

Description of the problem:

  1. Gencode GTF describes transcripts for a particular Gencode release.
  2. Gencode transcripts FASTA provides sequence data for the same Gencode release.
  3. These two files contain slightly different identifiers (out of sync), in Gencode.v32 there are 6 entries that differ. For example:

    "ENST00000452496.1_1" is provided by the FASTA file, but is not in the GTF file "ENST00000452496.1_2" is defined in the GTF file.

  4. Therefore, the tx2geneDF data does not match the tximport data.
## Count rownames in txiTx that match tx2geneDF$transcript_id
table(rownames(txiTx[[1]]) %in% tx2geneDF$transcript_id);
#  FALSE   TRUE
#      6 228114

This problem is probably specific to Gencode. (I notice tximport() includes arguments ignoreTxVersion and ignoreAfterBar to address mismatches -- they only work when returning gene data. This problem has been seen before.)

The fix for tx2geneDF:

## Fix tx2geneDF by removing underscore version _#
if (!"transcript_id_v" %in% colnames(tx2geneDF)) {
   tx2geneDF$transcript_id_v <- tx2geneDF$transcript_id;
}
tx2geneDF$transcript_id <- gsub("_[0-9]+", "", tx2geneDF$transcript_id_v);
rownames(tx2geneDF) <- tx2geneDF$transcript_id;

The fix for transcript data via tximport(), following syntax in the vignette:

## Fix txiTx
txiTx <- tximport::tximport(files,
   type="salmon",
   txOut=TRUE);
which_matrix <- which(sapply(txiTx, class) %in% "matrix");
for (i in which_matrix) {
   rownames(txiTx[[i]]) <- gsub("_[0-9]+",
      "",
      rownames(txiTx[[i]]));
}

Now the rownames in txiTx will match the rownames in tx2geneDF, and values in tx2geneDF$transcript_id:

## Confirm it worked
table(rownames(txiTx[[1]]) %in% tx2geneDF$transcript_id);
#   TRUE
# 228120
table(rownames(txiTx[[1]]) %in% rownames(tx2geneDF));
#   TRUE
# 228120

Finally make iMatrixTx and iMatrixTxTPM:

iMatrixTx <- log2(1 + txiTx$counts);
iMatrixTxTPM <- log2(1 + txiTx$abundance);

## Confirm the rownames match
table(rownames(iMatrixTx) %in% tx2geneDF$transcript_id);
#   TRUE
# 228120
table(rownames(iMatrixTx) %in% rownames(tx2geneDF));
#   TRUE
# 228120

Now the function defineDetectedTx() should work successfully, which was the original question!

jmw86069 commented 3 years ago

Closing.