pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
655 stars 172 forks source link

error in transcript2gene #268

Open nbahlis opened 4 years ago

nbahlis commented 4 years ago

tr2g2 <- transcript2gene(species = c("Homo sapiens"), type = "vertebrate",

  • kallisto_out_path = "./output/", ensembl_version = 99,
  • write_tr2g = FALSE)

Querying biomart for transcript and gene IDs of Homo sapiens Cache found Error in sort_tr2g(tr2g, kallisto_out_path = kallisto_out_path) : Some transcripts in the kallisto index are absent from tr2g.

What is the source of this error? any advice will be great

lambdamoses commented 4 years ago

The error message is self-explanatory. This is an error because if you want a gene count matrix, you need to translate the transcripts in the equivalence classes into genes, which is done with this function. But if there's a transcript in the kallisto index that is absent from the tr2g data frame, then with this information, there's no way to translate that transcript into its corresponding gene. This occurs because different Ensembl versions were used for the kallisto index and the tr2g data frame.

nbahlis commented 4 years ago

thank you. I thought I had matched the ensembl version. What the ensembl version for the prebuilt transcriptome index and t2g files dowloaded with kb ref -d human ?

lambdamoses commented 4 years ago

To be honest, I don't know since I did not build that version and usually I build the index myself with whatever Ensembl version I choose. @sbooeshaghi Can you update the documentation of kb to note the Ensembl version, as a matter of transparency?

lambdamoses commented 4 years ago

Another important thing: The Bioconductor 3.11 version of BUSpaRse by default removes scaffolds and haplotypes from the tr2g data frame. The tr2g_* function not only has an option to remove the scaffolds and haplotypes (the chrs_only argument), but also has options to filter transcripts and genes by biotype. This might have caused the error here since the prebuilt indices probably did not have such filtering. But the tr2g_* functions also by default extracts the transcriptome after removing scaffolds and haplotypes and filtering by biotypes, so you can use this extracted transcriptome to build a kallisto index.