pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
300 stars 94 forks source link

Error if aggregating by gene #122

Open ghost opened 7 years ago

ghost commented 7 years ago

Hello, I am trying to aggregate the data by gene. On transcript level everything works fine.

My data set consists of 32 RNA-seq samples, Sleuth Version: 0.29.0 .

This is the command I use: sleuth_prep(s2c, ~1, target_mapping = t2g, aggregation_column = "ens_gene")

reading in kallisto results dropping unused factor levels ................................ normalizing est_counts 52419 targets passed the filter normalizing tpm merging in metadata aggregating by column: ens_gene 20224 genes passed the filter

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 4128640 rows; more than 3985915 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

7 vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, incomparables = c(0L, NA_integer_))) NULL else as.double(nrow(x) + nrow(i))) 6[.data.table(y, x, nomatch = if (all.x) NA else 0, on = by, allow.cartesian = allow.cartesian) 5 y[x, nomatch = if (all.x) NA else 0, on = by, allow.cartesian = allow.cartesian] 4 merge.data.table(reads_table, mapping, by = "target_id", all.x = T) 3 merge(reads_table, mapping, by = "target_id", all.x = T) 2 reads_per_base_transform(ret$obs_norm, scale_factor, aggregation_column, mappings, norm_by_length) 1 sleuth_prep(s2c, ~1, target_mapping = t2g, aggregation_column = "ens_gene")

Thanks for the feedback!

warrenmcg commented 7 years ago

Hi @simiii123,

So, your transcript-level sleuth_prep command is something like sleuth_prep(s2c, ~1, target_mapping = t2g), correct? And that doesn't produce an error?

If it does not produce an error, could you do us a favor? Could you run the following code? I'm going to assume that you have your sample_to_covariates table s2c and your target_mapping table t2g.

so <- sleuth::sleuth_prep(s2c, ~1, target_mapping = t2g)
raw_data <- data.table::as.data.table(so$obs_raw[, c("target_id", "eff_len")])
mappings <- dplyr::select(t2g, target_id, ens_gene)
mappings <- data.table::as.data.table(mappings)
result <- merge(raw_data, mappings, by = "target_id", all.x = TRUE)

This is code that is adapted from the section that I think is causing the error in sleuth_prep. Does this produce the same error? If yes, there is something unexpected with either the target_ids in your raw_data or something unexpected in your t2g table (possibly NA values?). It would be helpful if you could post here your t2g table, and the raw_data object you created above (removing the counts and identification data to maintain the privacy of your dataset, and reducing the size of the resulting table) so that we could get to the bottom of this error.

If you didn't know already, you can create text files of both objects using the following lines:

write.table(raw_data, "test_data.txt", sep = "\t", row.names = FALSE, quote = FALSE)
write.table(t2g, "target_mappings.txt", sep = "\t", row.names = FALSE, quote = FALSE)

Once you have the text files, you can compress them (.zip, .tar.gz, etc) and the two files should then be small enough to post here (max size is 25 MB). (see here on how to attach files into issues)

I came across a similar error when developing this section, but I thought that the code in place handled this properly. Obviously, your data is presenting an edge case where it is not handled correctly.

ghost commented 7 years ago

Hi Warren,

thanks for the fast response! I ran your lines and there were no errors. I didn't find any NA values due to the mapping, this should be fine. Really appreciate the help!

The files you asked for are attached.

sleuth_test.tar.gz

I have another quick question, not sure this is the place to ask this though: In sleuth, filtering is only possible on transcript level. Is there a reason why applying the following steps to sleuth object (aggregated by gene) is wrong :

  1. extract raw_counts from sleuth
  2. aggregate by gene
  3. apply expression cutoff (e.g. 10 counts)
  4. filter so_gene$bs_summary$... by expression list passing filter
  5. run models ...

It works, and in the models the reduced set is used. But is there any stats prior to running the model that are invalid now?

Many thanks!

warrenmcg commented 7 years ago

@simiii123, if it didn't produce an error, then we haven't identified the problem (though thanks for sending the files over!). I realized that I misread the error message, and so I've modified the code. See below:

so <- sleuth::sleuth_prep(s2c, ~1, target_mapping = t2g)
## modified line to include "sample" column
raw_data <- data.table::as.data.table(so$obs_raw[, c("target_id", "eff_len", "sample")])

mappings <- dplyr::select(t2g, target_id, ens_gene)
mappings <- data.table::as.data.table(mappings)
result <- merge(raw_data, mappings, by = "target_id", all.x = TRUE)

## new code that I think will reproduce the error
norm_data <- data.table::as.data.table(so$obs_norm[, c("target_id", "sample")])
scale_factor <- result[, scale_factor := median(eff_len), by = list(sample, ens_gene)]
scale_factor <- scale_factor[, eff_len := NULL]
scale_factor <- data.table::as.data.table(dplyr::select(scale_factor_input, 
            target_id, sample, scale_factor))
new_result <- merge(norm_data, scale_factor, by = c("target_id", "sample"), all.x = TRUE)
new_result <- merge(new_result, mappings, by = "target_id", all.x = TRUE)

Let me know if this new code causes the same error you saw earlier.

If it does, could you resend me the updated raw_data object, and also the norm_data object in the same way you did last time?

NOTE: This will include sample names, so let me know if that will be a problem given the privacy of your dataset. If it will be, you can temporarily rename your samples generic names in your s2c (sample1, sample2, etc.) using s2c$sample <- paste0("sample", seq(1,32)). Keep the s2c$path the same.

With regard to your comment about filtering on the gene-level, it's a good question but I think this warrants a separate issue/suggestion, so I'll start that.

ghost commented 7 years ago

Hi @warrenmcg,

thanks again for the fast feedback! I also appreciate that you forwarded the request about filtering on gene level.

Good news are, your new code reproduced the error in the last line new_result <- merge(new_result, mappings, by = "target_id", all.x = TRUE)

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 4128640 rows; more than 3985915 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

raw_data.txt.gz

Unfortunately, the file size for both raw_data and norm_data exceeds the file size limit here. Now, I only appended raw_data, since it is identical to norm_data which is only missing the eff_len column. Is this ok?

It sounds like you already know what the error is? I hope I didn't overlook anything and produced the error myself, would hate to have wasted your time. Although I have used sleuth in the past on other data sets (fewer samples) and it always worked nicely.

warrenmcg commented 7 years ago

@simiii123, no worries! It shouldn't happen, so if it's something that you did, we should add a test for it to let users know sooner what went wrong with a meaningful error message so they can fix it.

Are you able to send the norm_data separately? I won't be able to reproduce the error on my end without it. I suspect that it is some combination of the norm_data and the mappings that is causing the issue. For example, something you can check is if the new_result table before the final line or mappings have NA values. If NAs are introduced into both of them, then all of the NAs will attempted to be matched with all of the other NAs, leading to lots of duplicate entries. This scenario is what happened to me during development.

Klim314 commented 7 years ago

Hi @warrenmcg, I've run into the same issue (Transcript level analysis runs without error while gene-level aggregation fails).

> so = sleuth_prep(s2c, md, target_mapping = t2g)
reading in kallisto results
dropping unused factor levels
......
normalizing est_counts
44304 targets passed the filter
normalizing tpm
merging in metadata
summarizing bootstraps
......

> so = sleuth_prep(s2c, md, target_mapping = t2g, aggregation_column = "ens_gene")
reading in kallisto results  

dropping unused factor levels  
......  
normalizing est_counts  
44304 targets passed the filter  
normalizing tpm  
merging in metadata
aggregating by column: ens_gene
6185 genes passed the filter
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 939798 rows; more than 668358 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

However, the above code you provided does not raise any error. What data would you require for troubleshooting?

Klim314 commented 7 years ago

I've done a quick followup for my case. Here, the issue seems to lie in the reads_per_base_transform function in measurement_error.R. In particular, this seems to stem from the following bit:

413  reads_table <- merge(reads_table, mapping, by = 'target_id', all.x=T)

This bug seems to have been introduced in commit f16ff40 with the conversion of the reads_table to a data.table, and the subsequent use of the merge function. By default, the data.table will not allow Cartesian joins if they do not obey the condition nrow(x)+nrow(i), which typically occurs in the case of large tables with duplicate keys (we have these).

I'm wondering if just setting allow Cartesian to true would fix this, or if you have some other concerns.

https://stackoverflow.com/questions/23087358/why-is-allow-cartesian-required-at-times-when-when-joining-data-tables-with-dupl

warrenmcg commented 7 years ago

Hi @Klim314, thanks for your work in trying to identify the error. You are correct that the error is caused by not allowing Cartesian joins. However, Cartesian joins, with a final number of entries greater than the original data table should not occur unless you have duplicate target IDs, which are problematic. Depending on how this error occurs, this may just require throwing a more informative error for folks, since this is a situation that should not happen.

In my previous testing, this problem was introduced by NA values being introduced into the target ID column, and this was then causing a larger number of duplicate matches.

It's weird that the above code does not reproduce the error. At bare minimum, I would need raw_data, norm_data, and mappings from the above code I posted. If it requires multiple postings, no problem.

@simiii123, if you could send your norm_data table, that would be helpful!

Thanks!

benjsmith commented 7 years ago

Hi @warrenmcg , I have been experiencing the same error since installing the latest version, although it worked perfectly on the same data in an older version.

After following the steps in this thread I found it was originally due to some target_ids not having associated gene ids (ext_gene in my case), they were filled with NAs, and your "new" code did reproduce the error.

However, I went back and generated a new t2g with biomaRt, so that this time I was sure all transcript_ids had an associated ext_gene and ens_gene value. This time I got the same error from sleuth prep but the new code did not reproduce the error. I did notice that the final new_result data table in the new code contained duplicate ens_gene columns, i.e. ens_gene.x and ens_gene.y. This seemed to be introduced because in the first of the final two lines (reproduced below) scale_factor already contained ens_gene and it seemed that the second line was duplicating that column. Do you see the same thing?

new_result <- merge(norm_data, scale_factor, by = c("target_id", "sample"), all.x = TRUE)
new_result <- merge(new_result, mappings, by = "target_id", all.x = TRUE)

I don't know if it's the same underlying issue, so it may require a separate thread, but I'm experiencing additional errors when I try to run the sleuth web app without gene aggregation on this data set. sleuth_prep ran fine, but the volcano plot showed "Error: All arguments must be named" where the table would normally display when selecting points in the plot. Also, in gene view, I'm getting "Error: subscript out of bounds". All other functions seem to be working though and the test table is populated normally with ext_gene and ens_gene values in it.

Let me know if you would also like me to post my data or any screenshots.

Thanks!

Klim314 commented 7 years ago

@warrenmcg

I've taken a closer look at the mappings and have noticed the presence of duplicate (target_id, ens_gene) pairs stemming from the allocation of multiple gene names to the pair (This was found in the org.mm.eg.db library data, but not in the biomart data). This resulted in a t2g map that looked like this:

> head(dplyr::arrange(duplicated_pairs, target_id))
           target_id           ens_gene     ext_gene
1 ENSMUST00000000266 ENSMUSG00000026535      Ifi202b
2 ENSMUST00000000266 ENSMUSG00000026535 LOC100044068
3 ENSMUST00000000574 ENSMUSG00000000562       Adora3
4 ENSMUST00000000574 ENSMUSG00000000562       Tmigd3
5 ENSMUST00000002452 ENSMUSG00000002379      Ndufa11
6 ENSMUST00000002452 ENSMUSG00000002379       Gm4943

Switching it out for a different t2g map with no duplicates seems to have fixed this.

warrenmcg commented 7 years ago

Thanks @Klim314 and @benjsmith for your sleuthing work here! (pardon the pun).

@simii123, do you have duplicated pairs of mappings? You can check this by doing the following: any(duplicated(mappings$target_id)). This should be FALSE, but if it's TRUE then you had the same problem.

@pimentel, how do you think we should handle this? I would think we should throw an informative error, and ask users to figure out how to make their t2g data frame have just one entry for each target_id, because I get the sense that duplicate entries in the t2g also leads to unexpected behavior when producing the results table.

warrenmcg commented 7 years ago

@benjsmith, your issues about the plotting should be in a separate issue. If you could open a separate issue for that, that would be helpful.

With regard to why you see duplicate columns, I realized I made a mistake in the above code. The point of the code above was to help y'all try to reproduce the error by exposing the internals, and I missed a step which involves selecting out the columns from scale_factor to avoid the duplicate columns. I've updated it to fix that logic error on my part.

Here's the new line in context:

scale_factor <- scale_factor[, eff_len := NULL]
scale_factor <- data.table::as.data.table(dplyr::select(scale_factor_input,  ## NEW LINE PART 1
            target_id, sample, scale_factor)) ## NEW LINE PART 2
new_result <- merge(norm_data, scale_factor, by = c("target_id", "sample"), all.x = TRUE)
new_result <- merge(new_result, mappings, by = "target_id", all.x = TRUE)

Could you rerun the updated code, and see if you reproduce the error now? If you do, first check to see if any of the ens_gene column has NA values, which will cause problems for now. If you don't, could you send your mappings, raw_data, and norm_data as compressed files, using the above instructions? Thanks!

pimentel commented 6 years ago

we think this is fixed in branch devel. let us know if you are willing to give it a shot.

Here's how to install:

devtools::install_github('pachterlab/sleuth', ref = 'devel')