hyginn / Ontoscope

BCB420 class project to identify "identity defining" transcription factors for cell types.
4 stars 6 forks source link

PREPMAR #10

Open thejmazz opened 8 years ago

thejmazz commented 8 years ago

Prepare MARA, use ISMARA server

ghost commented 8 years ago

I'm going to need to import the bed graphs from each cell line in the FANTOM data repository. Will need some help with writing code. Perhaps the authors of fantom_import could mod their code to also pull out the bed files? Thoughts?

thejmazz commented 8 years ago

@biodim @ontoscoper

biodim commented 8 years ago

I tried finding a .bed file for a sample but I couldn't do it. Let me know if you can find the .bed files and we'll quickly write a script to pull them out for you

What exactly are you looking for in a .bed file? Is it the "Genomic Coordinates"? Take a look at a single "fantomResults" Dataset (download the module ,transfer all required files and source the module):

source (file="fantom_main.R")

fantomDirect("55")

View(fantomResults[1])

Take a look at the "X00Annotation", anything useful for you there? As far as I know this is the start of .bed file (chromosome , start ,end and strand sense). There is currently work being done in seperating that one column into 4. Do you also need the peaks? (the p2, p1, p...)

If you want we can create a funtion : create???() that will :

1) create a new dataframe 2) split the annotation into first 4 columns (and the peaks if you need them) 3) the next columns would be counts for each sample (Do you need the counts?)

Kind of what fantomSummarize() currently does. Let me know what you want this named as well, as I don't think createBED() would work.

biodim commented 8 years ago

Also forgot to mention, we can probably pull any data you want from Fantom's "SStar Resource Browser":

Take a look at a sample page: http://fantom.gsc.riken.jp/5/sstar/FF:10017-101C8

Let me know if there is anything you want from there, IMO the JASPAR motifs could be useful. It will take a bit of time to code cause it is a slightly different approach (web scrapping vs direct queries)

thejmazz commented 8 years ago

I wrote a little web scraper for SStar. "Hard-coded" based on looking for <td class="Jaspar-motif and something-else possibly"> (CSS class names are separated by spaces). Looks like most other "data IDs" on the page live in <td>s too. Could be easily expanded for anything else on the page, and made into a simple CLI tool. Or even a little API that we could then do our own direct queries on.

biodim commented 8 years ago

Thats pretty neat! Apparently R has an "Embedded JavaScript Engine" (through the V8 package). Maybe I can add a small interface to use your script in the fantom_import module. Do you think you can modify your script to also return the P-values (they are under the p-value class) (so it would be like URL1, pvalue1; url2,pvalue2), then maybe we can sort through these in R

thejmazz commented 8 years ago

Wasn't able to get the script running through the V8 CRAN package (weird undefined is not an object error), but it works with system + jsonlite. Gives a data frame of motifs and their p-values. See getJaspar.

biodim commented 8 years ago

@cL9821

basic getBED function implemented that retrieves the .bed files based on fantom ontologies. Let me know if you want "keyword based" retrieval

ghost commented 8 years ago

@biodim

the getBED keeps aborting with this error message Error in download.file(as.character(BED_DB[dl_index, 1]), paste0(fixed_ID, : 'url' must be a length-one character vector 3 stop("'url' must be a length-one character vector") 2 download.file(as.character(BED_DB[dl_index, 1]), paste0(fixed_ID, ".bed.gz")) at getBED.R#32 1 getBED(IDs)

For now I just modify my IDs to exclude the files Ive already downloaded and re-run the function. But that means I will have to stay by the computer while I extract ~ 1000 files. Any ideas whats going on?

biodim commented 8 years ago

Can you give me an example of the input you use? (what the argument for getBED is) and if you are using a character, what the output of str(your_character) is

ghost commented 8 years ago

For the input to getBED, I took the FF_ontology values from the spreadsheet on our phylify module. I have a vector of length 1012, in the class = character. The first few are as follows: "FF:10000-101A1" "FF:10001-101A5" "FF:10007-101B4"

For some reason only a few have been giving out error messages. So far there's 5 where I get the error message. The most recent one is for "FF:10376-105G7"

Update: I'm up to 310 files of 1012

biodim commented 8 years ago

It's because one fantomID ie FF:10376-105G7 has two entries:

"http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.tissue.hCAGE/medial%2520temporal%2520gyrus%252c%2520adult%252c%2520donor10258%252c%2520tech_rep2.CNhs14552.10376-105G7.hg19.ctss.bed.gz"

"http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.tissue.hCAGE/medial%2520temporal%2520gyrus%252c%2520adult%252c%2520donor10258%252c%2520tech_rep1.CNhs14229.10376-105G7.hg19.ctss.bed.gz"

I made a temp fix (file 1.0.1) to skip the duplicate.

I will later make a 1.1 to grab the duplicate, you won't have to re-dl everything since Ill also add a file checker, if it exists it'll skip it

biodim commented 8 years ago

I've fixed this bug in the getBED 1.2.0, please redownload and see if it works. I haven't implemented a name checker, because I changed the naming convention (to take care of the duplicates). Here are the error IDs, make sure to re-download them (1.0.1 would skip these files, but 1.2.0 will grab them):

dup_ids <- c("FF:10063-101H9","FF:10071-101I8","FF:10370-105G1",
             "FF:10372-105G3","FF:10376-105G7","FF:10442-106F1",
             "FF:10444-106F3","FF:11227-116C3","FF:12632-134F4",
             "FF:12828-137A2")

getBED(dup_ids)

Thanks for debugging