Kallisto gtf + quantification giving error with Jaccard similarity

aleighbrown commented 5 years ago

Hello,

I've used Kallisto for quantification and rather than building my own Kallisto index I downloaded the one they provide for mm10 from here:

https://github.com/pachterlab/kallisto-transcriptome-indices/releases/tag/ensembl-96

Then for isoformSwitchR when I'm building the SwitchList I used the provided gtf and cdna from this download

exons = "/Users/annaleigh/cluster/vyplab_reference_genomes/kallisto/mus_musculus/Mus_musculus.GRCm38.96.gtf"
ntfasta = "/Users/annaleigh/cluster/vyplab_reference_genomes/kallisto/mus_musculus/Mus_musculus.GRCm38.cdna.all.fa"
### Create switchAnalyzeRlist
aSwitchList <- importRdata(
  isoformCountMatrix   = kaQuant$counts,
  isoformRepExpression = kaQuant$abundance,
  designMatrix         = myDesign,
  isoformExonAnnoation = exons,
  isoformNtFasta       = ntfasta
)

But I get an error message:

tep 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
    importing GTF (this may take a while)
Error in importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance,  : 
  The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). 
Either isforoms found in the annotation are not quantifed or vise versa. 
Specifically:
 118401 isoforms were quantified.
 116728 isoforms are annotated.
 Only 116728 overlap.
 1673 isoforms quantifed isoforms had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options:
 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
     Examples from expression matrix are : ENSMUST00000118795.7, ENSMUST00000051091.4, ENSMUST00000159347.1 
     Examples of annoation are : ENSMUST00000197124.4, ENSMUST00000206445.1, ENSMUST00000170887.7 
     Examples of isoforms which were only found im the quantification are  : ENSMUST00000201131.3, ENSMUST00000230289.1, ENSMUST00000201677.3 

If there is a large overlap but still far from complete there are 3 possibilites:
 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)

Should these files not work together? This might be more on the kallisto side than IsoformSwitcher's. Should I build my own index instead?

Thank you

dktanwar commented 5 years ago

👍

Same for me as well. I am doing with Salmon.

dktanwar commented 5 years ago

Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
    importing GTF (this may take a while)
Error in importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance,  : 
  The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925). 
Either isforoms found in the annotation are not quantifed or vise versa. 
Specifically:
 132721 isoforms were quantified.
 132170 isoforms are annotated.
 Only 131794 overlap.
 927 isoforms quantifed isoforms had no corresponding annoation

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options:
 1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
 2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
     Examples from expression matrix are : ENSMUST00000145263.1, ENSMUST00000078593.2, ENSMUST00000127869.1 
     Examples of annoation are : ENSMUST00000162332.1, ENSMUST00000172775.3, ENSMUST00000197364.4 
     Examples of isoforms which were only found im the quantification are  : tRNA-Ser-AGY, ORR1D2-int, Lx7 

If there is a large overlap but still far from complete there are 3 possibilites:
 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)

For more info see the FAQ in the vignette.

dktanwar commented 5 years ago

For me, the workaround is in this gist:

https://gist.githubusercontent.com/dktanwar/a2ae94850b2b27d73bc94fb11360e3ed/raw/37f03e17e016cefb97cd41d24122ede2d299c1de/importRdata1.R

There was this line

if( j1 < jcCutoff | length(onlyInExp) )

I made it: if( j1 < jcCutoff ) Because, after checking I found that the jaccard index was over 0.99.

kvittingseerup commented 5 years ago

I cannot recomend the solution provided by dktanwar since if the annotation does not fit perfectly with the quantified transcripts all downstream analysis will be untrustworthy (since you don't know which genes are affected).

The problem is that the Kallisto people have just directly use the Ensemble files which unfortunately do not fit together (as described here).

I have tried several times to persuade the Ensemble people to fix this to no avail - amongst other see this tweet - which also shows some Venn diagrams of the problem.

I have now also contacted the Kallisto authors to make them aware of this problem with their pre-build indexes.

aleighbrown commented 5 years ago

The solution I went with was to build my own kallisto index, it isn't that difficult; just thought to make you aware.

kvittingseerup / IsoformSwitchAnalyzeR

Kallisto gtf + quantification giving error with Jaccard similarity #46