Closed eric423627 closed 3 years ago
Hi Eric, thank you for the comments and the bug report, I shall take a look!
By eye it appears the issue is that your detectedTx
values are not all present inrownames(iMatrixTxTPM)
. From what you described, I think you might have no rownames(TxSE)
, which I think will cause problems with downstream functions.
In theory, rownames(TxSE)
are taken from rownames(rowData)
, and are expected to match all rownames in assays
. For example, whenever assays(TxSE)[[1]]
are returned, the rownames are inherited from rownames(TxSE)
. The values of rownames(assays[[1]])
are expected to have values from tx2geneDF[[txColname]]
to indicate each transcript.
The geneMatch
logic is intended to align the txiTx
data rows with the tx2geneDF
rows. The txiTx
and tx2geneDF
are not expected to be in the same order, so you're correct that it is probably very chaotic! I think that chaos is to be expected, so I hope that if you use geneMatch
it will work? :) I am not sure but I hope so!
If it does not work, can you post the output of these commands?
print(head(rownames(TxSE)))
print(head(rownames(iMatrixTxTPM)))
print(head(detectedTx))
print(head(tx2geneDF))
Again thank you for taking the time to post this bug, I appreciate the feedback!
Hi thanks for your reply, If i do not add the NULL line, In this command:
assayNames <- intersect(c("abundance","counts"), names(txiTx)); geneMatch <- match(rownames(txiTx[[assayNames[1]]]), tx2geneDF[,txColname]); sampleDF <- data.frame( Sample=colnames(txiTx[[assayNames[1]]]), Group=rep(c("control","high"), each=3), length.out=ncol(txiTx[[assayNames[1]]])); assays1=lapply(txiTx[assayNames], function(x){log2(1+x)}) rowData1=DataFrame(tx2geneDF[geneMatch,,drop=FALSE]) colData1=DataFrame(sampleDF)
TxSE <- SummarizedExperiment( assays=assays1, rowData=rowData1, colData=colData1, );
it continues saying: Error in .validate_names(rownames, ans_rownames, "assay rownames()", "rowData rownames() / rowRanges names()") : assay rownames() must be NULL or identical to rowData rownames() / rowRanges names()
I examined the order and it is chaotic. And that is the reason why i add a make rowname NULL line into the script.
And as you know, if this can not be performed then the TxSE can not be generated.
Thanks for your time for helping solve the problem.
Best,
And
Ah I think I understand the problem now. The rownames(assays)
are not identical to rownames(rowData)
.
See if this helps?
rownames(tx2geneDF) <- tx2geneDF[[txColname]];
Then as a spot-check you could test to see if rownames for txiTx are present in rownames(tx2geneDF):
# frequency of rownames in txiTx that are present in tx2geneDF
# all items should be TRUE
table( rownames(txiTx[[1]]) %in% rownames(tx2geneDF) )
Everything should be TRUE, hopefully.
Then for curiosity, test how many rows from tx2geneDF are present in txiTx:
# frequency of rownames in tx2geneDF that are present in txiTx
# most but maybe not all items will be TRUE
table( rownames(tx2geneDF) %in% rownames(txiTx[[1]]) )
I have noticed that sometimes txiTx does not contain every entry in tx2geneDF. But I expect every rowname in txiTx to be present in tx2geneDF.
Let me know if this helps!
Thank you so much for your reply. I tried your command, most of them are true, but some are false.
rownames(tx2geneDF) <- tx2geneDF[[txColname]] table( rownames(txiTx[[1]]) %in% rownames(tx2geneDF) ):
FALSE TRUE 1959 113311
Should i just remove these FALSE or you have any alternative way?
Many thanks
The answer you might expect, "it depends". :)
In my case, the rows in txiTx that were not in tx2geneDF were due to different versions of the GTF file. The Salmon GTF file was a slightly older Gencode GTF than the Gencode GTF I was using for my analysis.
So in that case, I did not want to remove FALSE entries, but instead use the matched GTF file.
You may have already done this, but I suggest inspecting a few entries in txiTx which are not found in tx2geneDF:
head(setdiff(rownames(txiTx[[1]]), rownames(tx2geneDF)), 20)
If the entries have a version suffix, something like "ENSMUST00000012.1" you might try looking for the entry without the version in tx2geneDF, like this:
subset(tx2geneDF, grepl("ENSMUST00000012", transcript_id))
I'm not sure what your GTF file looks like, but my guess is that Salmon/Kallisto might be slightly modifying the transcript_id, causing a mismatch.
Hi James,
Thanks so much for your help. I kind of understand the problem, I will try to look for the matched GTFs. I will keep updated if it works. Thank you.
Hi James,
Now this problem is fixed. All the 140000 are TRUE using your command:
table( rownames(tx2geneDF) %in% rownames(txiTx[[1]]) )
Everything looks great also generates nice MA-plot. However, the out of bound problem remains unsolved.
control high
FA-control1_quant 1 0 FA-control2_quant 1 0 FA-control3_quant 1 0 FA-high1_quant 0 1 FA-high2_quant 0 1 FA-high3_quant 0 1
Contrasts
Levels high-control control -1 high 1
runDiffSplice()
if (exists("TxSE")) {
- all(colnames(TxSE) == iSamples)
- all(iSamples == rownames(iDesign))
- all(colnames(iDesign) == rownames(iContrasts))
- iDesign %*% iContrasts;
- } Contrasts high-control FA-control1_quant -1 FA-control2_quant -1 FA-control3_quant -1 FA-high1_quant 1 FA-high2_quant 1 FA-high3_quant 1 iMatrixTxTPM <- assays(TxSE[,iSamples])[["abundance"]]; iMatrixTxTPM[iMatrixTxTPM == 0] <- NA; diffSpliceL <- runDiffSplice(
- iMatrixTx=iMatrixTxTPM,
- detectedTx = rownames(iMatrixTxTPM),
- tx2geneDF=tx2geneDF,
- txColname=txColname,
- geneColname=geneColname,
- iDesign=iDesign,
- iContrasts=iContrasts,
- cutoffFDR=0.05,
- cutoffFold=1.5,
- collapseByGene=TRUE,
- spliceTest="t",
- verbose=FALSE,
- useVoom=FALSE); Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds
Do you know what is the problem? Thank you so much for your patience.
I apologize for the issues, I'm glad you're patient also!
Can you try verbose=TRUE
and see if it gives informative output? I see that internally it uses your detectedTx
then queries tx2geneDF
to find genes present multiple times. The verbose output may give some insight into whether it is failing during this step.
Also is uses the column in tx2geneDF
defined by the argument txColname
, make sure that is the correct column name to use. The values in that column should correspond to rownames(iMatrixTxTPM)
.
Hi Thank you for your reply. It did give some instructions:
diffSpliceL <- runDiffSplice(
- iMatrixTx=iMatrixTxTPM,
- detectedTx=detectedTx,
- tx2geneDF=tx2geneDF,
- txColname=txColname,
- geneColname=geneColname,
- iDesign=iDesign,
- iContrasts=iContrasts,
- cutoffFDR=0.05,
- cutoffFold=1.5,
- collapseByGene=TRUE,
- spliceTest="t",
- verbose=TRUE,
- useVoom=FALSE);
(10:10:12) 02Mar2020: runDiffSplice(): Started with 36,328 entries representing 16,007 genes.
(10:10:12) 02Mar2020: runDiffSplice(): Ended with 28,735 entries representing 8,414 multi-entry genes.
Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds
I have at last found the source of the problem. Forgive the delay in following up.
The issue originates with Gencode, sorry to point a finger at them. They provide a GTF file, and associated transcripts.fasta file -- but the two files for the same version of Gencode, do not use the same transcript identifiers. sigh Due respect, it's not an easy job. But we fully expect Gencode v32lift37 to be consistent with itself.
For example:
gencode.v32lift37.annotation.gtf.gz
contains transcript_id "ENST00000620688.2_3"
gencode.v32lift37.transcripts.fa.gz
contains transcript_id "ENST00000620688.2_2"
The same discrepancy is seen with gene_id
and havana_gene
.
This issue must be documented elsewhere -- it throws warnings when using Salmon on transcript alignments, because the ID values do not match.
I have a workaround, but will need to do some testing first. Removing the _#
version is not sufficient, because we want tximport()
to succeed using tx2gene, which means the values must match. I think the answer is to import transcript-quant data, then convert to gene with tximport::summarizeToGene()
and a modified tx2gene
.
I can't imagine a scenario where we would ignore expression data for these entries -- presumably transcripts with frequent updates are most susceptible, and any such transcripts may or may not be of high priority to a given project.
FYI I have encountered this error a couple more times, and figured out the best workaround as I understand it.
I think I need a function to make these steps easier, except that it is probably only relevant to Gencode data... Suggestions welcome. :)
Description of the problem:
These two files contain slightly different identifiers (out of sync), in Gencode.v32 there are 6 entries that differ. For example:
"ENST00000452496.1_1"
is provided by the FASTA file, but is not in the GTF file"ENST00000452496.1_2"
is defined in the GTF file.
tx2geneDF
data does not match the tximport
data.## Count rownames in txiTx that match tx2geneDF$transcript_id
table(rownames(txiTx[[1]]) %in% tx2geneDF$transcript_id);
# FALSE TRUE
# 6 228114
This problem is probably specific to Gencode. (I notice tximport()
includes arguments ignoreTxVersion
and ignoreAfterBar
to address mismatches -- they only work when returning gene data. This problem has been seen before.)
The fix for tx2geneDF:
## Fix tx2geneDF by removing underscore version _#
if (!"transcript_id_v" %in% colnames(tx2geneDF)) {
tx2geneDF$transcript_id_v <- tx2geneDF$transcript_id;
}
tx2geneDF$transcript_id <- gsub("_[0-9]+", "", tx2geneDF$transcript_id_v);
rownames(tx2geneDF) <- tx2geneDF$transcript_id;
The fix for transcript data via tximport()
, following syntax in the vignette:
## Fix txiTx
txiTx <- tximport::tximport(files,
type="salmon",
txOut=TRUE);
which_matrix <- which(sapply(txiTx, class) %in% "matrix");
for (i in which_matrix) {
rownames(txiTx[[i]]) <- gsub("_[0-9]+",
"",
rownames(txiTx[[i]]));
}
Now the rownames in txiTx
will match the rownames in tx2geneDF
, and values in tx2geneDF$transcript_id
:
## Confirm it worked
table(rownames(txiTx[[1]]) %in% tx2geneDF$transcript_id);
# TRUE
# 228120
table(rownames(txiTx[[1]]) %in% rownames(tx2geneDF));
# TRUE
# 228120
Finally make iMatrixTx
and iMatrixTxTPM
:
iMatrixTx <- log2(1 + txiTx$counts);
iMatrixTxTPM <- log2(1 + txiTx$abundance);
## Confirm the rownames match
table(rownames(iMatrixTx) %in% tx2geneDF$transcript_id);
# TRUE
# 228120
table(rownames(iMatrixTx) %in% rownames(tx2geneDF));
# TRUE
# 228120
Now the function defineDetectedTx()
should work successfully, which was the original question!
Closing.
Hi, I am using your great software want to call differential splicing. However, when i reached here, there was a problem: diffSpliceL <- runDiffSplice( iMatrixTx=iMatrixTxTPM, detectedTx= detectedTx, tx2geneDF=tx2geneDF, txColname=txColname, geneColname=geneColname, iDesign=iDesign, iContrasts=iContrasts, cutoffFDR=0.05, cutoffFold=1.5, collapseByGene=TRUE, spliceTest="t", verbose=FALSE, useVoom=FALSE);
Saying: Error in iMatrixTx[detectedTxUse, , drop = FALSE] : subscript out of bounds How can i solve them?
I use mouse as species, i followed your pipeline, all is well, only one part when i do geneMatch <- match(rownames(txiTx[[assayNames[1]]]), tx2geneDF[,txColname]); It seems the order is chaotic, so i use this following instead. assays1=lapply(txiTx[assayNames], function(x){log2(1+x)}) rowData1=DataFrame(tx2geneDF[geneMatch,,drop=FALSE]) colData1=DataFrame(sampleDF) rownames(rowData1) <- NULL TxSE <- SummarizedExperiment( assays=assays1, rowData=rowData1, colData=colData1, );
This "rownames(rowData1) <- NULL" is what i add.