Closed aleighbrown closed 5 years ago
👍
Same for me as well. I am doing with Salmon.
Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
importing GTF (this may take a while)
Error in importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925).
Either isforoms found in the annotation are not quantifed or vise versa.
Specifically:
132721 isoforms were quantified.
132170 isoforms are annotated.
Only 131794 overlap.
927 isoforms quantifed isoforms had no corresponding annoation
This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.
If there is no overlap (as in zero or close) there are two options:
1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
Examples from expression matrix are : ENSMUST00000145263.1, ENSMUST00000078593.2, ENSMUST00000127869.1
Examples of annoation are : ENSMUST00000162332.1, ENSMUST00000172775.3, ENSMUST00000197364.4
Examples of isoforms which were only found im the quantification are : tRNA-Ser-AGY, ORR1D2-int, Lx7
If there is a large overlap but still far from complete there are 3 possibilites:
1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)
For more info see the FAQ in the vignette.
For me, the workaround is in this gist:
There was this line
if( j1 < jcCutoff | length(onlyInExp) )
I made it: if( j1 < jcCutoff )
Because, after checking I found that the jaccard index was over 0.99.
I cannot recomend the solution provided by dktanwar since if the annotation does not fit perfectly with the quantified transcripts all downstream analysis will be untrustworthy (since you don't know which genes are affected).
The problem is that the Kallisto people have just directly use the Ensemble files which unfortunately do not fit together (as described here).
I have tried several times to persuade the Ensemble people to fix this to no avail - amongst other see this tweet - which also shows some Venn diagrams of the problem.
I have now also contacted the Kallisto authors to make them aware of this problem with their pre-build indexes.
The solution I went with was to build my own kallisto index, it isn't that difficult; just thought to make you aware.
Hello,
I've used Kallisto for quantification and rather than building my own Kallisto index I downloaded the one they provide for mm10 from here:
https://github.com/pachterlab/kallisto-transcriptome-indices/releases/tag/ensembl-96
Then for isoformSwitchR when I'm building the SwitchList I used the provided gtf and cdna from this download
But I get an error message:
Should these files not work together? This might be more on the kallisto side than IsoformSwitcher's. Should I build my own index instead?
Thank you