Background regions always report (incorrectly) that they don't encompass the input set

semenko commented 8 years ago

Ran into an interesting issue submitting a .bed with a background region.

GREAT encountered a user error:
The foreground set is not a subset of the background set. GREAT
requires each foreground element to be reproduced exactly within the
background set.

However, this is definitely a subset of the input set -- and submitted manually online, GREAT completes successfully.

I can reproduce this with just the test code:

gr = circlize:::generateRandomBed(nr = 1000, nc = 0)
job = submitGreatJob(head(gr), bg=gr)

jokergoo commented 8 years ago

It is because the files I use from GREAT do not provide correspondance between regions and genes, to trace this information, internally I added a fourth column in the bed data frame which is a combination of chr and positions:

chr1 100 200

chr1 100 200 chr1:100-200

and then I can retrieve this correspondance back from GREAT.

But the problem now is if you provide a self-defined background regions, since background regions do not have the fourth column, GREAT will give you an error.

I will try to contact developers of GREAT to see whether it is possible to get rid of this or if they can directly add position information in the gene-region association file.

semenko commented 8 years ago

Ah, cool, thanks! I'll temporarily pass the chr1:N-NN bit in bed_bg as well.

jokergoo commented 8 years ago

Thanks for the patch! It works!

mshicuhk commented 8 years ago

Hi, Guys~ I've come across the same problem again. My input data for background is just a three-column data frame, let's say it as "bg1". When I tried

job <-submitGreatJob(bg1[c(1,3,4,5,6),],bg=bg1,species="hg19",bgChoice="data",rule="basalPlusExt",max_tries=90,version="3.0.0")

The same error returns. However, weirdly, when I used the bed file generated by

bed = circlize::generateRandomBed(nr = 1000, nc = 0),

and then

job <- submitGreatJob(bed[c(1,3,4,5,6),],bg=bed,species="hg19",bgChoice="data",rule="basalPlusExt",max_tries=90,version="3.0.0"),

the job could be submitted successfully.

May I ask whether there is any more requirements, other than sth like

head(bg1)

chr start end 1 chr1 713200 713400 2 chr1 906400 906600 3 chr1 907000 907200 4 chr1 914400 914600

, for the background data?

mshicuhk commented 8 years ago

OK. I finally figured out the problem by myself. It is the "overlapping regions will be merged" step that made my test file and background file different. I do not know whether it could be a bug for this program, but I'm still wondering why this step is required.

jokergoo commented 8 years ago

The merging of the input regions is just to decrease the redundancy of the input regions. To my understanding, keeping regions which are overlapped (e.g. 1000 same regions while not merging them) will make bias for the functional enrichment.

Regarding the background setting, I think the requirement is the input regions should be subsets of the background, which means for every region in gr, there must be a region in background which completely covers it. But I think I can check it in submitGreatJob() before submitting to GREAT website.

mshicuhk commented 8 years ago

Thank you very much for the reply! Now, I'm clear what you've concerned about the overlapping regions, but I am not sure whether GREAT can detect "for every region in gr, there must be a region in background which completely covers it", because GREAT requires every region in gr must be "duplicated-exactly" found in the background. For example, I have chr 1 100 200 in gr, but chr1 100 200; chr1 201 300 in background, which becomes chr1 100 300 after merging. Then, my chr1 100 200 cannot be found in background and error returns.

jokergoo commented 8 years ago

Yes you are right! I understood wrong for this point because I never use background for my analysis. Now I think I need to figure out a better way to deal with this kind of scenario. Because users' input can be all kinds (i.e. valid input or invalid input), I want this package to do pre-processing of the inputs before submitting to GREAT website in order to make users' life easier.

I think the worst scenario is, let's say chr1 200 300; chr1 250 400 as gr and chr1 100 250; chr1 300 500; chr1 400 600 as bg, which means gr in not completely covered by bg and there are overlaps inside both gr and bg, I think maybe we should convert gr and bg to:

for gr:

chr1 200 250
chr1 300 400

for bg:

chr1 100 200
chr1 200 250
chr1 300 400
chr1 400 600

What do you think?

mshicuhk commented 8 years ago

Great sorry for late reply! The idea is good. However, for the region chr1 250 300 in your example, it exists in user's gr but not in bg. Thus, a warning information should be given out to the user to remind them of sth like "the foreground set is not a subset of the background set". While, your conversion made the "wrong" input "correct". Moreover, although I haven't test, I think the number of bg regions is related with the hypergeometric test used in GREAT. Therefore, when you perform the step of merging, you are acutally reducing the users' input regions. I haven't thought about how these changes would affect the final results in detail, but in my opinion, the potential impacts should be addressed to the users.

jokergoo commented 8 years ago

Good! Thanks for your comments!

jokergoo / rGREAT

Background regions always report (incorrectly) that they don't encompass the input set #4