pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
305 stars 95 forks source link

Sleuth-Error in check_target_mapping #257

Open nmorf opened 3 years ago

nmorf commented 3 years ago

Hello,

I'm trying to use bioMart to retrieve the gene names from Apis mellifera from Ensemble. I'm trying to analyze the data generated by Kallisto using Sleuth.

I encounter the error posted in 2017 (link below). I haven't been able to fix it myself. I was wondering if someone could direct me to a possible solution without editing the fasta files?

https://github.com/pachterlab/sleuth/issues/111

Here is the error message that I get.

mart <- useMart('metazoa_mart', host = 'metazoa.ensembl.org') mart <- useDataset('amellifera_eg_gene', mart)

t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",

  • "external_gene_name"), mart = mart) t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
  • ens_gene = ensembl_gene_id, ext_gene = external_gene_name) so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g) reading in kallisto results dropping unused factor levels ........................ Error in check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column)) : couldn't solve nonzero intersection In addition: There were 25 warnings (use warnings() to see them)

Thank you, nm

gcamprecios commented 2 years ago

Good morning,

This is the first time I use Sleuth after pseudoalignment with kallisto. Quite new to this. Everything runs well, except for when I try to collapse transcripts to genes with the target_mapping. I get exactly the same error as nmorf above, and I was wondering if it had been solved somewhere else. I can't seem to find an answer, and I've tried to generate all kinds of files to use this function. Here it is the code I am using, which is basically what I see in the walkthroughs and from everybody! To generate the t2g file:

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = 'ensembl.org') t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id","ensembl_transcript_id_version", "ensembl_gene_id", "ensembl_gene_id_version","external_gene_name","description", "chromosome_name","start_position", "end_position","strand", "entrezgene_id"), mart = mart) t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id, ens_gene = ensembl_gene_id, ext_gene = external_gene_name)

t2g <- dplyr::select(t2g, c('target_id', 'ens_gene', 'ext_gene'))

To run the sleuth_prep function:

so122 <- sleuth_prep (metadata122, target_mapping = t2g, aggregation_column = 'ens_gene', read_bootstrap_tpm = TRUE, extra_bootstrap_summary = TRUE, transformation_function = function(x) log2(x + 0.5), num_cores = 2)

The error I get all the time (no matter how I construct the t2g data.frame):

Warning: It appears that you are running Sleuth from within Rstudio. Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1. If you wish to take advantage of multiple cores, please consider running sleuth from the command line.reading in kallisto results dropping unused factor levels Error in check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column)) : couldn't solve nonzero intersection

And here I show you he first rows of our .tsv abundance file from kallisto (I use the .h5 for the sleuth_prep:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

target_id |   |   |   |   |   |   |   | length | eff_length | est_counts | tpm -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ENST00000456328.2 | ENSG00000223972.5 | OTTHUMG00000000961.2 | OTTHUMT00000362751.1 | DDX11L1-202 | DDX11L1 | 1657 | processed_transcript | 1657 | 1453.07 | 0 | 0 ENST00000450305.2 | ENSG00000223972.5 | OTTHUMG00000000961.2 | OTTHUMT00000002844.2 | DDX11L1-201 | DDX11L1 | 632 | transcribed_unprocessed_pseudogene | 632 | 428.3 | 0 | 0 ENST00000488147.1 | ENSG00000227232.5 | OTTHUMG00000000958.1 | OTTHUMT00000002839.1 | WASH7P-201 | WASH7P | 1351 | unprocessed_pseudogene | 1351 | 1147.07 | 0 | 0 ENST00000619216.1 | ENSG00000278267.1 | - | - | MIR6859-1-201 | MIR6859-1 | 68 | miRNA | 68 | 34.625 | 0 | 0 ENST00000473358.1 | ENSG00000243485.5 | OTTHUMG00000000959.2 | OTTHUMT00000002840.1 | MIR1302-2HG-202 | MIR1302-2HG | 712 | lncRNA | 712 | 508.07 | 0 | 0

sigusn commented 2 years ago

Hi, I think there could be some issue with the abundance file. I usually only have one column with "target_id" but you have more columns without headings. Example of my abundance.tsv target_id length eff_length est_counts tpm ENST00000631435.1 12 6.64286 0 0

gcamprecios commented 2 years ago

HI @sigusn , thanks very much for the response. Indeed, I found another page where all this issue was discussed and solved back in 2019. My problem is that I generated my kallisto with the genomcodev40, and all my abundance files had looong "target_id" names, which made it impossible to match. I used their code to change the names to all the abundance files at once, leaving only the ENS name.

I leave here the page with the discussion and solution. Thanks!

https://groups.google.com/g/kallisto-and-applications/c/KQ8782UD35E/m/hbqqMOgGBwAJ