GDSC1000 cell line data

chapmandu2 commented 8 years ago

Just wanted to highlight (in case you weren't already aware) the recent publication in Cell of the Genomics of Drug Sensitivity in Cancer 1000 (GDSC) data set. What is interesting from our perspective is that all of the raw data has been released under GPL, which means that there are no issues around redistribution. See below: http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html

vjcitn commented 8 years ago

Thanks for the tip Phil

Any sense on how best to import/preprocess, e.g., the expression data? could use affy::rma with the hgu219cdf

i have no experience with that array.

On Sat, Jul 23, 2016 at 1:32 PM, Phil Chapman notifications@github.com wrote:

Just wanted to highlight (in case you weren't already aware) the recent publication in Cell http://www.cell.com/cell/fulltext/S0092-8674(16)30746-2 of the Genomics of Drug Sensitivity in Cancer 1000 (GDSC) data set. What is interesting from our perspective is that all of the raw data has been released under GPL, which means that there are no issues around redistribution. See below: http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/154, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwllU9-58TgYzIeyeRDFmyB9z4iZEks5qYlA_gaJpZM4JTZzw .

chapmandu2 commented 8 years ago

Must admit I think it's great that they've released it all, but also feel a bit overwhelmed by it!! I also have no experience of that array but I was thinking along the same lines. There is a BrainArray custom cdf which I would be inclined to use since it gives a single value per gene mapped to an ENSEMBL identifier: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/20.0.0/ensg.asp

They have also provided their own processed data so that may be worth investigating although perhaps better to start from the cel files?

It's a fantastic dataset, although quite how best to integrate this with the CCLE and various other datasets (including RNAseq) is an open question! But it's a nice one to work with I think because it's hot off the press, the CCLE data was published 4 years ago and was generated well before that so looking a bit long in the tooth.

vjcitn commented 8 years ago

Working from the mature data as used in the manuscript makes sense and seems a good test case for MAE and -- I would assume -- disk-backed approach. We should divide and conquer. Possible plan:

0) check whether any of the cancer cloud providers have already processed the data for use, in which case we could focus on interface to these and perhaps skip to part 2) below

1) identify the specific resources at http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.html that we want to represent

2) identify some use cases, e.g. replicating key findings of the paper as special cases of generic operations

3) define the API for our GDSC1K resource

4) assign data components to various team members and set some milestones

5) benchmark solutions to the use cases

On Sun, Jul 24, 2016 at 2:16 AM, Phil Chapman notifications@github.com wrote:

Must admit I think it's great that they've released it all, but also feel a bit overwhelmed by it!! I also have no experience of that array but I was thinking along the same lines. There is a BrainArray custom cdf which I would be inclined to use since it gives a single value per gene mapped to an ENSEMBL identifier:

http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/20.0.0/ensg.asp

They have also provided their own processed data so that may be worth investigating although perhaps better to start from the cel files?

It's a fantastic dataset, although quite how best to integrate this with the CCLE and various other datasets (including RNAseq) is an open question! But it's a nice one to work with I think because it's hot off the press, the CCLE data was published 4 years ago and was generated well before that so looking a bit long in the tooth.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/154#issuecomment-234760240, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwlVjywh4akKslfw9nrplz8mkDuOWks5qYwNAgaJpZM4JTZzw .

chapmandu2 commented 8 years ago

Thanks Vince your plan sounds like a good one and I'd be happy to contribute. I would suggest in the first instance focussing on the affy data and the mutation data since that's probably the most frequently used, then broadening out to the other datatypes (such as methylation and copy number).

Following on from our previous discussions about actual compound screening data in CCLE, I would perhaps suggest leaving that out of scope in the first instance, or considering it as a phase 2. Let's not try to eat the elephant all in one go.

chapmandu2 commented 8 years ago

My thoughts looking at the data in some more detail of what needs doing:

Affy

RMA values provided here - http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip
But only provide gene symbol against unknown genome version, would need to turn into SummarizedExperiment or ExpressionSet
So my recommendation would be to just normalise the whole lot against the version 20 CDF for the HG_U219 array here - http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/20.0.0/ensg.asp Not sure which version of Ensembl this is but should be able to find out.
CEL files and sample info to cel file names into cell line ids on ArrayExpress here - https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3610/files/

Mutation data:

This is provided in table S2C here - http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/suppData/TableS2C.xlsx
Provides ensembl transcript id's and gene symbols, but version 56 of Ensembl (!). So would need to update to match the same version of Ensembl as used for the affy data
Don't have genomic coordinates as far as I could see
Would want to include mutation information - silent, missense, nonsense, truncating, effect on protein etc

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

waldronlab / MultiAssayExperiment

GDSC1000 cell line data #154