Open thejmazz opened 8 years ago
I'm going to need to import the bed graphs from each cell line in the FANTOM data repository. Will need some help with writing code. Perhaps the authors of fantom_import could mod their code to also pull out the bed files? Thoughts?
@biodim @ontoscoper
I tried finding a .bed file for a sample but I couldn't do it. Let me know if you can find the .bed files and we'll quickly write a script to pull them out for you
What exactly are you looking for in a .bed file? Is it the "Genomic Coordinates"? Take a look at a single "fantomResults" Dataset (download the module ,transfer all required files and source the module):
source (file="fantom_main.R")
fantomDirect("55")
View(fantomResults[1])
Take a look at the "X00Annotation", anything useful for you there? As far as I know this is the start of .bed file (chromosome , start ,end and strand sense). There is currently work being done in seperating that one column into 4. Do you also need the peaks? (the p2, p1, p...)
If you want we can create a funtion : create???() that will :
1) create a new dataframe 2) split the annotation into first 4 columns (and the peaks if you need them) 3) the next columns would be counts for each sample (Do you need the counts?)
Kind of what fantomSummarize() currently does. Let me know what you want this named as well, as I don't think createBED() would work.
Also forgot to mention, we can probably pull any data you want from Fantom's "SStar Resource Browser":
Take a look at a sample page: http://fantom.gsc.riken.jp/5/sstar/FF:10017-101C8
Let me know if there is anything you want from there, IMO the JASPAR motifs could be useful. It will take a bit of time to code cause it is a slightly different approach (web scrapping vs direct queries)
I wrote a little web scraper for SStar. "Hard-coded" based on looking for <td class="Jaspar-motif and something-else possibly">
(CSS class names are separated by spaces). Looks like most other "data IDs" on the page live in <td>
s too. Could be easily expanded for anything else on the page, and made into a simple CLI tool. Or even a little API that we could then do our own direct queries on.
Thats pretty neat! Apparently R has an "Embedded JavaScript Engine" (through the V8 package). Maybe I can add a small interface to use your script in the fantom_import module. Do you think you can modify your script to also return the P-values (they are under the p-value class) (so it would be like URL1, pvalue1; url2,pvalue2), then maybe we can sort through these in R
Wasn't able to get the script running through the V8 CRAN package (weird undefined is not an object
error), but it works with system
+ jsonlite. Gives a data frame of motifs and their p-values. See getJaspar.
@cL9821
basic getBED function implemented that retrieves the .bed files based on fantom ontologies. Let me know if you want "keyword based" retrieval
@biodim
the getBED keeps aborting with this error message Error in download.file(as.character(BED_DB[dl_index, 1]), paste0(fixed_ID, : 'url' must be a length-one character vector 3 stop("'url' must be a length-one character vector") 2 download.file(as.character(BED_DB[dl_index, 1]), paste0(fixed_ID, ".bed.gz")) at getBED.R#32 1 getBED(IDs)
For now I just modify my IDs to exclude the files Ive already downloaded and re-run the function. But that means I will have to stay by the computer while I extract ~ 1000 files. Any ideas whats going on?
Can you give me an example of the input you use? (what the argument for getBED is) and if you are using a character, what the output of str(your_character) is
For the input to getBED, I took the FF_ontology values from the spreadsheet on our phylify module. I have a vector of length 1012, in the class = character. The first few are as follows: "FF:10000-101A1" "FF:10001-101A5" "FF:10007-101B4"
For some reason only a few have been giving out error messages. So far there's 5 where I get the error message. The most recent one is for "FF:10376-105G7"
Update: I'm up to 310 files of 1012
It's because one fantomID ie FF:10376-105G7 has two entries:
I made a temp fix (file 1.0.1) to skip the duplicate.
I will later make a 1.1 to grab the duplicate, you won't have to re-dl everything since Ill also add a file checker, if it exists it'll skip it
I've fixed this bug in the getBED 1.2.0, please redownload and see if it works. I haven't implemented a name checker, because I changed the naming convention (to take care of the duplicates). Here are the error IDs, make sure to re-download them (1.0.1 would skip these files, but 1.2.0 will grab them):
dup_ids <- c("FF:10063-101H9","FF:10071-101I8","FF:10370-105G1",
"FF:10372-105G3","FF:10376-105G7","FF:10442-106F1",
"FF:10444-106F3","FF:11227-116C3","FF:12632-134F4",
"FF:12828-137A2")
getBED(dup_ids)
Thanks for debugging
Prepare MARA, use ISMARA server