nsheff / LOLA

Locus Overlap Analysis: Enrichment of Genomic Ranges
http://code.databio.org/LOLA
70 stars 19 forks source link

Add in missing cellType/antibody entries for encodeTFBSmm10 #25

Closed oneillkza closed 6 years ago

oneillkza commented 6 years ago

Hi there

I noticed that a handful of the entries for encodeTFBSmm10 have NA in the cellType and antibody annotations. Fortunately, this data seems to be available here, and I've enclosed a code snippet that adds the missing entries to lola.db$regionAnno to make things easier for you. (Right now I'm using the below code as a workaround for myself.)

Thanks for making and maintaining a very useful tool!

lola.db <- loadRegionDB('LOLA/LOLACore/mm10') #change to wherever LOLACore is downloaded

# Fix missing cell/antibody entries in LOLA

encode.meta <- read.table('https://raw.githubusercontent.com/theaidenlab/juicebox/master/src/juicebox/encode/encode.mm9.txt',
                          sep='\t',
                          header=TRUE)

encode.meta$path <- sub('.*\\/', '', encode.meta$path)
encode.meta$path <- sub('.gz$', '', encode.meta$path)

rownames(encode.meta) <- encode.meta$path

lola.missing <- which(lola.db$regionAnno$collection=='encodeTFBSmm10'&is.na(lola.db$regionAnno$cellType))

missing.files <- lola.db$regionAnno$filename[lola.missing]

lola.db$regionAnno[lola.missing, c('cellType', 'antibody')] <- 
    encode.meta[missing.files, c('cell', 'antibody')]
oneillkza commented 6 years ago

Oh interesting -- digging a little further, it seems like all of the entries with NAs for cellType and antibody are "RepPeaks" type, meaning they are individual replicates. However, their merged/consensus data ("Peaks") are also present in LOLACore.

For most analyses, one probably wouldn't want to be analysing both the individual replicates and their merged data as though they were independent. It might actually be better to drop these from LOLA?

nsheff commented 6 years ago

Probably right. But would you rather keep the "RepPeaks" or the individual replicates?

oneillkza commented 6 years ago

I'd rather keep the consensus than the replicates (ie the Peaks rather than the RepPeaks). I've been noticing quite a lot of deviation between replicates, which is presumably why they did them in the first place.

nsheff commented 6 years ago

Thanks for reporting this @oneillkza -- I found 45 extra files in there that shouldn't have been. They had already been excluded from the annotation, but because they were left in the folder, they were still getting read (a feature of LOLA, really...). Anyway, I've taken them out now and will update the public core databases soon. thanks!

nsheff commented 6 years ago

New version is now deployed here: http://cloud.databio.org/regiondb/

oneillkza commented 6 years ago

Awesome! Thanks!