hitz / RegulomeDB

Website for the Regulome Database
http://regulome.stanford.edu
9 stars 4 forks source link

Duplicated nucleotides or SNPs #1

Open euriehong opened 13 years ago

euriehong commented 13 years ago

Eurie: Can the results be uniqued if there are duplicates? For example, entering the same input will result in duplicate rows. Try

14 100705101 100705102 14 100705101 100705102

OR

rs78077282 rs78077282

Ben's response: I suppose so ... but I think we have to do it at the "output snp" level rather than the input. Because if 2 overlapping ranges are entered the input will look the same but some of the output will be duplicated. The ranges are first converted to lists of valid (common) snps and the looked up for their scores.

I didn't do this originally because I was concerned with speed/memory for very large input sets... but since those aren't functioning now anyway, I think it would be reasonable to track if a given SNP is entered and not look it up/report it another time.

euriehong commented 13 years ago

Yes, was referring to the results page.

hitz commented 12 years ago

What about we do this for "paste" input but not file input. I don't think this will work with the file input (designed for large datasets). The app doesn't have any knowledge prior input (when a file is uploaded) ; it just writes to disk now. We might be able to do a "dedupe" when files are exported, since we have to do a sort and file conversion anyway.

hitz commented 12 years ago

I have been stalling on this but I think I think it's actually sort of critical. Uploading any kind of bed or gff with regions is dumb. I think a minimal has will only add a few Mb to the RAM footprint.