pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
305 stars 95 forks source link

adding gene names not working? #10

Closed macmanes closed 9 years ago

macmanes commented 9 years ago

With the following transcriptome:

ftp://ftp.ensembl.org/pub/release-75/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.75.cdna.abinitio.fa.gz

This code does not work - gene names are not updated in the Shiny app - they are still the non-human readable 'codes`. I assume that I should be seeing something else. What am I doing wrong here?

mart <- biomaRt::useMart(biomart = "ensembl", dataset = "dmelanogaster_gene_ensembl")

t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
 "external_gene_name"), mart = mart)

t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
 ens_gene = ensembl_gene_id, ext_gene = external_gene_name)

so <- sleuth_prep(kal_dirs, s2c, ~ condition, target_mapping = t2g)
pimentel commented 9 years ago

Hi @macmanes,

Thanks for the report. I think you figured it out (since the issue is closed now), but the transcript names from the annotation you use must match the 'target_id' passed into 'target_mapping'. It seems like most of the transcripts in that annotation begin with 'SNAP*':

~/Downloads ❯❯❯ grep '^>' Drosophila_melanogaster.BDGP5.75.cdna.abinitio.fa| head                          master ✱
>SNAP00000000001 cdna:snap chromosome:BDGP5:U:6470062:6470310:-1 transcript_biotype:protein_coding
>SNAP00000000002 cdna:snap chromosome:BDGP5:U:6497572:6500403:-1 transcript_biotype:protein_coding
>SNAP00000000003 cdna:snap chromosome:BDGP5:U:6470510:6471187:-1 transcript_biotype:protein_coding
>SNAP00000000004 cdna:snap chromosome:BDGP5:U:6527654:6535220:1 transcript_biotype:protein_coding
>SNAP00000000005 cdna:snap chromosome:BDGP5:U:6544504:6544794:-1 transcript_biotype:protein_coding
>SNAP00000000006 cdna:snap chromosome:BDGP5:U:6451750:6455515:1 transcript_biotype:protein_coding
>SNAP00000000007 cdna:snap chromosome:BDGP5:U:6532760:6533602:-1 transcript_biotype:protein_coding
>SNAP00000000008 cdna:snap chromosome:BDGP5:U:6536512:6537394:-1 transcript_biotype:protein_coding
>SNAP00000000009 cdna:snap chromosome:BDGP5:U:6526150:6527614:1 transcript_biotype:protein_coding
>SNAP00000000010 cdna:snap chromosome:BDGP5:U:6534198:6534782:-1 transcript_biotype:protein_coding

While these values aren't in the Ensembl biomaRt:

R> grep('SNAP', t2g$target_id, value = TRUE)
character(0)

I'm not familiar with Drosophila, and I'm a bit weirded out that it seems like Ensembl gives you that annotation and the biomaRt results are in Flybase IDs.

ttriche commented 9 years ago

This is something I stumbled upon with TxDbLite -- indexing/annotating from the FASTA sometimes leads you to a different conclusion than doing it from biomaRt. Will write this up; I had considered deprecating Dmel due to its lack of support within BioC OrganismDbi packages, but you just changed my mind.

--t

On Wed, Aug 19, 2015 at 10:52 AM, Harold Pimentel notifications@github.com wrote:

Hi @macmanes https://github.com/macmanes,

Thanks for the report. I think you figured it out (since the issue is closed now), but the transcript names from the annotation you use must match the 'target_id' passed into 'target_mapping'. It seems like most of the transcripts in that annotation begin with 'SNAP*':

~/Downloads ❯❯❯ grep '^>' Drosophila_melanogaster.BDGP5.75.cdna.abinitio.fa| head master ✱

SNAP00000000001 cdna:snap chromosome:BDGP5:U:6470062:6470310:-1 transcript_biotype:protein_coding SNAP00000000002 cdna:snap chromosome:BDGP5:U:6497572:6500403:-1 transcript_biotype:protein_coding SNAP00000000003 cdna:snap chromosome:BDGP5:U:6470510:6471187:-1 transcript_biotype:protein_coding SNAP00000000004 cdna:snap chromosome:BDGP5:U:6527654:6535220:1 transcript_biotype:protein_coding SNAP00000000005 cdna:snap chromosome:BDGP5:U:6544504:6544794:-1 transcript_biotype:protein_coding SNAP00000000006 cdna:snap chromosome:BDGP5:U:6451750:6455515:1 transcript_biotype:protein_coding SNAP00000000007 cdna:snap chromosome:BDGP5:U:6532760:6533602:-1 transcript_biotype:protein_coding SNAP00000000008 cdna:snap chromosome:BDGP5:U:6536512:6537394:-1 transcript_biotype:protein_coding SNAP00000000009 cdna:snap chromosome:BDGP5:U:6526150:6527614:1 transcript_biotype:protein_coding SNAP00000000010 cdna:snap chromosome:BDGP5:U:6534198:6534782:-1 transcript_biotype:protein_coding

While these values aren't in the Ensembl biomaRt:

R> grep('SNAP', t2g$target_id, value = TRUE) character(0)

I'm not familiar with Drosophila, and I'm a bit weirded out that it seems like Ensembl gives you that annotation and the biomaRt results are in Flybase IDs.

— Reply to this email directly or view it on GitHub https://github.com/pachterlab/sleuth/issues/10#issuecomment-132719681.