hyginn / Ontoscope

BCB420 class project to identify "identity defining" transcription factors for cell types.
4 stars 6 forks source link

PHYLIFY #4

Open thejmazz opened 8 years ago

thejmazz commented 8 years ago
thejmazz commented 8 years ago

Notes from discussion with @eugeniabarkova

High-level overview of how GATHER #9 will use this module:

/**
  * @param cellLine The cell line to get a nice background for. Comes as CL:xx or FF:xx id?
  * @param bgLower lower bound on bg. can be # of edges, weight
  * @param bgUpper upper bound on bg. can be # of edges, weight
  * @returns <List> of cell IDs to feed FANTOM
  */
function getBackgroundList(cellLine, bgLower, bgUpper) {}

Some questions to investigate:

To produce a nice tree:

Potential prune strategies

if name != "FF" or name != "CL":
     dont include
     for each child:
          dont include

Notes

biodim commented 8 years ago

There are 1829 FF:X-Y IDs (see human phase 2 @ http://biomart.gsc.riken.jp/). Currently I have mapped 1596 of these samples (fantom_samples dataframe in the fantom_import module), that is 1596 samples have a "cell" - "ontology ID" connection. The ontology file isn't consistent (it uses CD+ for some cells and CD-positive for others), plus it has spelling mistakes in some samples so my scripts didn't catch all the cases. Ill probably go back and manually curate it

A potential issue is separating the mouse samples from the human samples. For example take a look at these two IDs: FF:3632-171A1 and FF:11641-122D3. They look identical, but one is a Mouse Microglial Cell and the other is a human macrophage. From what I understand, all human samples have a 5 digit "start ID" and the human start ID always begins with 1. Technically you dont have to separate the two, because the fantom module only returns the human samples, but if you give it a mouse+human ID mix, it will output something like "150 / 200 Ontology IDs matched", not necessarily because it is missing the other 50 but because that 50 were the mouse IDs

As far as I know phase 2 data is just a bigger dataset. The phase 2 dataset has phase 1 data built into it. I somewhat verified this manually (because I accidentally built the module with phase 1 data and when I upgraded it to phase 2, I saw many of the phase 1 samples). The fantom Module extracts the queries from a "phase1and2combined" file so you don't have to worry about phase selection. Also the length of the "phase1and2combined" file matches the length of phase 2 from biomart

I checked and the fantom module can take lists as an input, but only if each term is seperated by a comma (the spacing doesn't matter). If you subset the data each term won't necessarily be seperated by a comma, but you can use regex to replace all whitespace with "," and it will work. "FF:X-Y FF:X2-Y2 FF:X3-Y3" -> "FF:X-Y,FF:X2-Y2,FF:X3-Y3". Let me know if you need assistance with this data prep step

thejmazz commented 8 years ago

@biodim thanks for the comments. going to dedicate today to this, I'll ping you if I have some questions

thejmazz commented 8 years ago

Notes

list of good FF:A-B

FF:A-B -> celltype ?

FF:001

A -> 001 ??

heuristics -> cell lineage w/ only cells (e.g. no "1h", "sample")

  1. Build from OBO
  2. Query: find node
  3. find all siblings+children
  4. find parents at specified level
  5. count number of descendants
  6. leaf-list, root-list, components
burtonlm commented 8 years ago

Hey, here are some guidelines (heuristics) that I would follow to clean up the OBO file before converting to a cell lineage graph. Not sure if this helps you... Let me know if you were wanting something else. -Keep terms that have 'CL' in their ID -If you don't want to include cell samples, remove any terms with 'FF' in the ID. But maybe you want to include cell samples so that GATHER can use the tree to obtain expression profiles for cells and their backgrounds? I'm not sure about this... -Remove anatomical parts (terms which contain 'UBERON' in the ID) -Remove NCBI Taxons, which have 'NCBITaxon' in the ID -Remove 'mouse' terms (if any part of the term (ID, is_a, etc.) contains 'mouse') -Remove diseases ('DOID' in the ID) -Remove molecules ('CHEBI' in the ID)

hyginn commented 8 years ago

Could you post these on your Student Wiki page? Thanks!

Boris

On Mar 23, 2016, at 5:48 PM, burtonlm notifications@github.com wrote:

Hey, here are some guidelines (heuristics) that I would follow to clean up the OBO file before converting to a cell lineage graph. Not sure if this helps you... Let me know if you were wanting something else. -Keep terms that have 'CL' in their ID -If you don't want to include cell samples, remove any terms with 'FF' in the ID. But maybe you want to include cell samples so that GATHER can use the tree to obtain expression profiles for cells and their backgrounds? I'm not sure about this... -Remove anatomical parts (terms which contain 'UBERON' in the ID) -Remove NCBI Taxons, which have 'NCBITaxon' in the ID -Remove 'mouse' terms (if any part of the term (ID, is_a, etc.) contains 'mouse') -Remove diseases ('DOID' in the ID) -Remove molecules ('CHEBI' in the ID)

� You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

ghost commented 8 years ago

http://fantom.gsc.riken.jp/5/datafiles/phase2.0/basic/HumanSamples2.0.sdrf.xlsx

Take a look at this file. We can turn this to a R-dataframe and subset out the rows that are not 'time course' or 'fractionation' and you will have a data frame with ff_ontology for only primary cells, tissues, or cell lines.

thejmazz commented 8 years ago

@cL9821 Bless up. Nice find. I've added this to checklist.

PS. I edited your comment to make it not just link to "url"

ghost commented 8 years ago

Does phase 2.0 include all the cell lines and data from phase 1.0 and up?

thejmazz commented 8 years ago

I believe so, @biodim mentioned above that "phase 1 + 2" == "length of phase 2 from biomart"

thejmazz commented 8 years ago

I saved the HumanSamples2.0.sdrf.xlsx as CSV and read.csv() it, and cleaned up the colnames(). Commit soon.

Anyone have any ideas why this newline is introduced? happens when subsetting too. not really a problem since I can just gsub it out, but curious as it seems to randomly come out of nowhere lol (and obscures the intent of the code):

> unique(humanSamples$Category)
[1] tissues                          cell lines                       primary cells                   
[4] time courses                     fractionations and perturbations
Levels: cell lines fractionations and perturbations primary cells time courses tissues
> gsub("\n*$", "", unique(humanSamples$Category))
[1] "tissues"                          "cell lines"                       "primary cells"                   
[4] "time courses"                     "fractionations and perturbations"

EDIT: I discovered levels():

> levels(humanSamples$Category)
[1] "cell lines"                       "fractionations and perturbations" "primary cells"                   
[4] "time courses"                     "tissues"
thejmazz commented 8 years ago

@cL9821 can you try to find the same thing for mouse? So we can remove all mouse IDs. This would solve what @biodim brought up:

A potential issue is separating the mouse samples from the human samples.

We can also just invert the IDs in the HumanSamples2.0 file, but getting the mouse IDs feels more safe.

Any more "meta info" that seems useful for these FF:num and FF:A-B IDs would be great too. I can combine it all into one module so we can effortlessly play with different subset strategies and heuristics.

ghost commented 8 years ago

http://fantom.gsc.riken.jp/5/datafiles/latest/basic/MouseSamples2.0.sdrf.xlsx

Mouse IDs

biodim commented 8 years ago

Yep the phase 2 includes the phase 1. We are pulling data from a file called: "phase1and2combined" and its column number (minus the "description" columns) matches the phase 2 sample number under http://biomart.gsc.riken.jp/

I've looked into the FF:XXX vs FF:A-B and I am 100% sure that the A-B are not cell to cell conversions identificators:

1) the FANTOM data wasn't collected to facilitate cell-cell conversions so it would be surprising if they identified their data that way

2) the parent - "human monocyte immature derived DC" FF:0000044 has 4 children: FF:11227-116C3, FF:11308-117C3, FF:11384-118B7, FF:11228-116C4. The IDs are somewhat random. I think the A part and the B part are sequentially generated (meaning those samples were processed/analyzed first. So biologically related samples were processed at later time and have different IDs. I think there might be something to the "LETTER-NUMBER" indicator at the end. I initially though they were replicates/donors, but it is not that. The first 3 of the above FF IDs are the same cell, but with 3 different donors

biodim commented 8 years ago

I think Dan's Reverse Engineering idea is pretty good. Take a look

at a simple mogrify query:

www.mogrify.net/joint_reprogrammers?source_ont=FF:0000004&target_ont=FF:0010019

We can see that the source and the target are two fantom parent nodes (FF:0000004 & FF:0010019, the ont stands for ontology)

Now take a look at the source code of the query and at the

/reprogrammers? section (Huge block of text).

Each "CNhs1423,CNhs11328" is actually a fantom ID. We can pull data from the fantom database using these values. For example

CNhs11328 is actually FF:10717-109I6

so the source: "FF:0000004" has a whole bunch of IDs associated with them (the CNhs1423,CNhs11328, etc)

the target: "FF:0010019" only has two (CNhs10625,CNhs11786)

if we cycle through their entire database and scrape the source we can associate each parent ID (FF no dash) with each child (FF:A-B)

Also if you check the source of view-source:http://www.mogrify.net/

they actually tell you their target/source fantom IDs so it'd be pretty easy to create an algorithm to scrape (you don't have to guess whether the parent IDs exist or not)

The big problem(s) is:

1) Does it matter how the parents connect (we can use ontology to generate the conections, even if we do it manually)

2) If it doesn't matter then it would be perfect. We could use mogrify's source-target mechanism (I think)

3) It's not complete. I estimate there are only around 100 source/target cells while the entire fantom database is ~2000

If we decide to go this route, let me know and Ill create a: "CNhs11328 to FF:A-B" converter to get the proper IDS for retrieval

thejmazz commented 8 years ago

@cL9821 thanks, lol, it was literally just gsub("Human", "Mouse", url)

@biodim thanks for the insight into FF:XXX vs FF:A-B. I have 100% faith in your 100% belief :+1:

I also like @ontoscoper's reverse engineering approach. It is a great way to compare our subset to theirs. But I think we should use it as an addition, or for verification of our approach. Can probably fall under the scope of the VALIDATE module. That being said, should definitely be used as part of the phylify module. e.g. which nodes did we take that mogrify didn't?

Nice source digging. Getting source and target IDs is definitely doable, as well as parsing the /reprogrammers? query.

I can modify my other scraper a bit to get those easily You would just change the URL, and the html parser a bit, onopentag, ontext, and onclosetag are pretty self explanatory, you essentially just modify some flag variables through closures and build results when your in the right tag.

If you can write the "CNhs11328 to FF:A-B" that would be great! i.e. I think it is totally worthwhile coding the reverse enginner approach.

Unfortunately I am busy tonight trying to get a proposal for something else done by tomorrow. But feel free anyone to contribute. My goal is to create a set of functions that can be used to explore filtering techniques on the ontology. My next steps were to

The main idea is that "the filtering" will occur through easy to use, nicely named functions, so that you can jump, script a few lines, check out the visualization, and improve the filtering.

Then Gather #9 will take the filtered graph (should be an igraph object, but thats fine there is the data frame to igraph function as mentioned above) and given the threshold params, generate a list of IDs.

@everyone I have read through all your obo heuristic suggestions. They look good, thanks. Will make a summary and continue work on this soon! Check here to see everyone's in one page.

thejmazz commented 8 years ago

@biodim how to get that list of 1596 good FF:A-B IDs? I have done:

# Load up the fantom_import module
source("../fantom_import/fantom_main.R")

# How many FF IDs?
length(fantom_samples$FANTOM.5.Ontology.ID) # 1829
length(grep("FF", as.character(fantom_samples$FANTOM.5.Ontology.ID))) # 1596
# Building a comma seperated string
fantom_ids <- gsub(" NA,", "", paste(fantom_samples$FANTOM.5.Ontology.ID, collapse=", "))
length(unlist(strsplit(fantom_ids, ", "))) # 1596

# This is taking a little while..
FO <- fantomOntology(fantom_ids)

Not sure which RData Sample to load (if any), or which variables those create. I ran fantomSummary() after loading one and it said it didn't have stuff. So maybe I load one of them and then do FO <- fantomOntology(fantom_ids)?

I just need the list though, not the counts. The output says

Returning RAW COUNTS
MATCHED: 1593 of 1593
1593 Search Result(s) Were Found. Loading...

so looks like it should be simple enough to make have a helper function for that part (maybe there already is) and use it on its own? Let me know hows best to proceed.

Essentially, I would just like a list of FF IDs that you have defined as "good" (i.e. we can actually get data for them for DESeq) to

Also, let me know if you have a nicer way to build the comma separated list.

biodim commented 8 years ago

All those 1596 are good and we can get the count data from all of them for deseq

You don't need to do the comma seperation anymore, Dan fixed it. So to get the "good" IDs:

#Get "Good" IDs
good_ids <- as.character(fantom_samples[!is.na(fantom_samples[,2]),2])

#Get Counts
fantomOntology(good_ids)

#Get Summary ie the "Gather" Output
fantomSummarize(2)

Note the way the functions are used, they are kind of like commands rather then the standard r functions that you have to assign.

fantomOntology/fantomKeyword will generate a fantomResults dataframe, no need to assign it.

so even if you do:

FO <- fantomOntology("FF:13552-145I6")

Your FO variable will be NULL and all your data will be in your fantomResults file.

Same thing with fantomSummarize(), it will automatically generate a fantomCounts file (no need to assign, think "command to a module").

I did it this way to simplify things, there is quite some stuff done in the background and it would be prone to error, it is done automatically for you.

This way, the entire "Gather" Process is simplified to this:

#Get Counts
fantomOntology(ids)

#Get Summary ie the "Gather" Output
fantomSummarize(2)

This generates a fantomCounts dataframe and assuming the IDs are in a "Cell of interest, background1, background2, ... backgroundn" format it is exactly what deseq needs

If you want to do it for two cells (cell a and cell b):

##Cell A
fantomOntology(ids_a)
fantomSummarize(2)
CBX_a <- fantomCounts

##Cell B
fantomOntology(ids_b)
fantomSummarize(2)
CBX_b <- fantomCounts

Loading 1600 IDs would be a nightmare though. Its roughly 2 minutes per ID

thejmazz commented 8 years ago

Cool, thanks. Yeah, I killed it at 131 lol. From what I can see in your code you are essentially looping through each ID, creating a URL, and then freading everything.

I took one example access number and curled it:

~/Desktop$ curl "http://fantom.gsc.riken.jp/5/tet/search/?c=0&c=1&c=4&c=5&c=6&c=147&filename=hg19.cage_peak_phase1and2combined_counts_ann_decoded.osc.txt.gz" -O
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.9M    0 13.9M    0     0   386k      0 --:--:--  0:00:36 --:--:--  382k

-O writes the output to the the file of the url, which will be a little messy and in this case and be ?c=0&....

So assuming each is 15mb, that leaves us with 24 GB. Which is not that bad, Steam games can be like 60GB these days lol. But we are limited by the Fantom file server upload speed which is a measly 400 kb/s or so.

biodim commented 8 years ago

Its bearable when there is 10-20 IDs, let me know if you plan on having anything significantly more (50 IDs) and Ill add a "fantomOffline" function. I would've done it sooner, but I don't want to hit their servers too much and I wasn't sure how to transfer all those fantom files (maybe a USB key)

thejmazz commented 8 years ago

I actually wrote up a quick bash script to loop through the access IDs and curl each one. (I injected a return statement into one of your functions to print out the access IDs, then reset the changes). I can bring to class on a hard drive on Tuesday. Or maybe upload to google drive. Going to be around 20GB for raw and 24 GB for RLE. At 785 for RLE and 1091 for raw atm. Raws should finish tonight at 2am, and RLEs at 6pm Monday by my estimations :p

Then it would just be a matter of pointing your fread to a localserver/local folder.

Another note, a lot of the file contents look to be identical. There are just some numbers that are different in some places. I wonder how one could economize on that, and pipe out less data for the same information..

biodim commented 8 years ago

Yep the only thing that is different is the counts (the last column), the first 5 columns are the same.

It was either get the first five columns and then using a loop or something slap them on to each individual count column or grab all 6 columns at once.

You don't have to waste your bandwidth on the RLE counts, I don't think we will be using them for this project

It would also be a bit more efficent if you only grab using this URL:

http://fantom.gsc.riken.jp/5/tet/search/?c=NUMBER&filename=hg19.cage_peak_phase1and2combined_counts_ann_decoded.osc.txt.gz

and replace NUMBER from 7 to 1835. This way, there will be no need to process the other columns. The other columns are pretty useful (uniprot ID, HGNC, etc), but they won't come in use for this project.

But your almost done so no need to start from scratch. I can write a script that will go through the files and drop the first 5 columns

Ill then write a function that grabs column 2 once, and then append the relevant counts to it, then use the existing scripts to process the file into a "CBX" format.

Are the file names roughly in:

"293SLAM rinderpest infection, 12hr, biol_rep3.CNhs14415.13549-145I3" format? Do they end on the fantom ID?

thejmazz commented 8 years ago

Here's the first (7 for raw). Follows format you described I believe, don't see it ending on fantom ID.

screen shot 2016-03-27 at 8 58 13 pm

(open image separately for full res)

I believe thats the URL I'm using for raw actually, but with some ?c=. Was wondering why those extras were there? I think param queries get overwritten if they are repeated. Oh, theres no spec lol. So yeah lol.