nsheff / LOLA

Locus Overlap Analysis: Enrichment of Genomic Ranges
http://code.databio.org/LOLA
70 stars 19 forks source link

Negative c entry in table. Bug with userSetsLength; this should not happen. #11

Open enricoferrero opened 8 years ago

enricoferrero commented 8 years ago

I created a custom LOLA database using the instructions in the README.md file. The database loads fine and is cached without problems.

However, when I run runLOLA() I get this warning message:

2: In runLOLA(userSets, userUniverse, lolaCustom, cores = detectCores()) :
  Negative c entry in table. Bug with userSetsLength; this should not happen.

and indeed quite a few of my c values are negative. The data.table returned only has the following columns: userSet, dbSet, support, b,c and d.

How can I fix this?

The same userSets and userUniverse work perfectly fine with LOLACore and LOLAExt.

This is how I create my userUniverse:

sheffield_dnase <- unlist(lolaCore$regionGRL[which(lolaCore$regionAnno$collection == "sheffield_dnase")])
userUniverse <- sort(disjoin(c(sheffield_dnase, unlist(userSets))))

This is my userSetsLength:


          first second           third fourth           fifth sixth           seventh
              518                 1               483                 3               280                 4               245
eighth           ninth tenth           eleventh
               51              1450                60               812

Thanks for your help! Enrico

nsheff commented 8 years ago

If your userSets have regions that overlap multiple universe regions, this can happen. So, you haven't defined your userSets in terms of your universe. One giant region in the userSet could cover 15 universe regions, scoring 15 overlaps, but only registered as a single entity in the userSet, which can mess up the contingency table.

Try setting redefineUserSets=TRUE in runLOLA. Does it solve it?

enricoferrero commented 8 years ago

No, unfortunately it doesn't. I get the same warning and output with: res <- runLOLA(userSets, userUniverse, lolaCustom, cores=detectCores(), redefineUserSets=TRUE)

The regions in my userSets are all FIMO scans of motifs and therefore quite short in length (between 8 and 31 bp in fact).

I don't really understand what could possibly be wrong with this data. Any idea?

Thinking out loud:

Thanks,

nsheff commented 8 years ago

Your userSets should already be disjoint, or else you would get an error. Removing negative c entries isn't the best because the fact that you have negative c entries is warning you that something is wrong with the way the test is constructed; eliminating these doesn't make the other tests accurate.

Ok, another idea. Try running reduce instead of disjoin on your database. I think maybe you are getting lots of little overlapping regions in your database because of the way you're combining potentially overlapping stuff and then disjoining.

enricoferrero commented 8 years ago

OK, I made some tests and randomly stumbled upon a solution (that I don't understand).

The custom LOLA database I built initially contains three collections: JASPAR, Taipale and UniProbe (see here: http://www.uwencode.org/proj/CATO/).

Running runLOLA() with the whole database gives me the error above. However, rather unexpectedly, if I create three databases with one collection each, runLOLA() works flawlessly!

Any idea why that would happen? Maybe it has to do with the length of the object? Each of the collection has several hundreds files.

enricoferrero commented 8 years ago

Side question:

is:

sheffield_dnase <- unlist(lolaCore$regionGRL[which(lolaCore$regionAnno$collection == "sheffield_dnase")])
userUniverse <- disjoin(c(sheffield_dnase, unlist(userSets)))
res <- runLOLA(userSets, userUniverse, regionDB, cores=detectCores(), redefineUserSets=FALSE)

equivalent to:

sheffield_dnase <- unlist(lolaCore$regionGRL[which(lolaCore$regionAnno$collection == "sheffield_dnase")])
userUniverse <- disjoin(sheffield_dnase)
res <- runLOLA(userSets, userUniverse, regionDB, cores=detectCores(), redefineUserSets=TRUE)

Why/Why not?

The first approach (i.e.: include your own regions in the universe) makes more sense to me but maybe I misunderstood what redefineUserSets=TRUE does?

nsheff commented 8 years ago

do you mind putting the database somewhere where I can download it? Or email it to me (contact info)

nsheff commented 8 years ago

And for your side question: No, these are very different.

Option 1: just puts your user sets into the universe (and then disjoins them so there are no overlaps, separating regions out). Option 2: your user sets will not be in the universe. Then you redefine the usersets; any that were not in the universe would no longer be in your user sets.

Try combining your options. Put your user sets in the universe, and then redefine the userSets.

you can check out ?redefineUserSets This function will take the user sets, overlap with the universe, and redefine the user sets as the set of regions in the user universe that overlap at least one region in user sets. this makes for a more appropriate statistical enrichment comparison, as the user sets are actually exactly the same regions found in the universe. otherwise, you can get some weird artifacts from the many-to-many relationship between user set regions and universe regions.

enricoferrero commented 8 years ago

Sorry I can't as it also contain some unpublished data that I don't own. So, would you recommend always using redefineUserSets=TRUE then? Also, can you please confirm that including userSets in userUniverse is a sensible thing to do?

nsheff commented 8 years ago

Sorry I can't as it also contain some unpublished data that I don't own.

Just a reproducible minimal example would be fine, if you can concoct one.

So, would you recommend always using redefineUserSets=TRUE then?

Ideally you don't need to redefine user sets because you have already defined them in the same way that you defined your universe. If you defined them in almost the same way, you probably don't have to worry about it. If they're very different, then I would at least see how things change when you redefine them. So, it's like a helper function in case you are constructing a universe on the fly. So the short answer is: it probably doesn't hurt, but it may not be necessary and it adds compute time. Look also at ?buildRestrictedUniverse.

Also, can you please confirm that including userSets in userUniverse is a sensible thing to do?

Well, any userSets that are not in the universe must be ignored... because sort of by definition, a userSet must be in the universe (that's what the universe is: the regions that could have been included in userSets). Again, ideally, your universe is defined in a way that already includes all your userSets. If not, you can sort of hack it by just adding the userSets into the universe (as you do above). I think this is a reasonable thing to do becomes sometimes it's difficult to come up with the exact appropriate universe for your test. But if there are many userSets not in the universe you want to use, you're probably using the wrong universe, and just adding the userSets may not be what you want. So, in some cases it's sensible, but you should think a bit about the universe you are using and what it means for your contingency table.