pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
305 stars 95 forks source link

problem with target_mapping #146

Closed DrHeisenberg89 closed 7 years ago

DrHeisenberg89 commented 7 years ago

Hi everyone, I am a bit confused about a painful step of my pipeline using Sleuth. I am trying to move from the transcript level to the gene level and that's why I am using biomaRt (?).

devtools::install_github('pachterlab/sleuth', ref = 'devel', force=TRUE)
mart <- biomaRt::useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", host = 'ensembl.org')
mart
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
                                 "external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
                 ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, target_mapping = t2g)
at this point the output is the following one:

so <- sleuth_prep(s2c, target_mapping = t2g)
reading in kallisto results
dropping unused factor levels
......
normalizing est_counts
47250 targets passed the filter
normalizing tpm
merging in metadata
summarizing bootstraps
......
Warning messages:
1: In check_num_cores(num_cores) :
  It appears that you are running Sleuth from within Rstudio.
Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
2: In check_target_mapping(tmp_names, target_mapping) :
  intersection between target_id from kallisto runs and the target_mapping is empty. 
attempted to fix problem by removing .N from target_id, then merging back into target_mapping. please check obj$target_mapping to ensure this new mapping is correct.

I can't understand the problem because I am pretty sure I have already created a table with the specific gene names and they match!

                target_id           ens_gene         ext_gene
333    ENSMUST00000148678 ENSMUSG00000020225           Tmbim4

Help?(!)

PS: please pardon my ignorance in term of bioinformatic jargon

warrenmcg commented 7 years ago

Hi @DrHeisenberg89,

Your sleuth_prep step completed successfully, as only warnings were listed. The important part is that sleuth detected that there was no direct overlap between the target_id of your kallisto results, and the target_id of your target mapping. This frequently happens because the annotations from Ensembl and Gencode include the "version number" with the transcript ID.

For example, the transcript ID you gave, ENSMUST00000148678, is listed as ENSMUST00000148678.2 on the Ensembl website. ENSMUST00000148678 =/= ENSMUST00000148678.2, so merging the target mapping annotations would result in no overlap, and that's what causes the warning. Sleuth automatically tries to correct this common problem, and reports it to you as a warning in case this leads to unexpected behavior downstream.

To avoid the message on subsequent runs, I would recommend you include the version number like this:

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl", host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
                                 "external_gene_name", "transcript_version"), mart = mart)
t2g <- dplyr::mutate(t2g, target_id = paste(ensembl_transcript_id, transcript_version, sep = "."))
t2g <- dplyr::select(t2g, target_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, target_mapping = t2g)

if you run head(t2g) after running the above, you should see the following:

             target_id           ens_gene ext_gene
1 ENSMUST00000082423.1 ENSMUSG00000064372    mt-Tp
2 ENSMUST00000082422.1 ENSMUSG00000064371    mt-Tt
3 ENSMUST00000082421.1 ENSMUSG00000064370  mt-Cytb
4 ENSMUST00000082420.1 ENSMUSG00000064369    mt-Te
5 ENSMUST00000082419.1 ENSMUSG00000064368   mt-Nd6
6 ENSMUST00000082418.1 ENSMUSG00000064367   mt-Nd5

To run the gene-level aggregation and following the tutorial, you can run the following line:

so_gene <- sleuth_prep(s2c, target_mapping = t2g, aggregation_column = "ens_gene")

As this is expected behavior for sleuth_prep, this is not an issue or bug. Thus, I'm going to close this. If you have further problems with this, feel free to reopen this issue or submit a question to the kallisto-sleuth Google Group (link)