hdng / clonevol

Inferring and visualizing clonal evolution in multi-sample cancer sequencing
GNU General Public License v3.0
138 stars 45 forks source link

cluster.counts #25

Open amanda-fitz opened 5 years ago

amanda-fitz commented 5 years ago

Hi there, can anyone help? I have two things I'd like to add to my cluster.counts data file.

  1. cluster number assigned by ClonEvol (1,2,3,4 etc) to data table cluster.counts - currently my cluster.counts data table only has the PyClone cluster ID.

  2. Median CCF per sample (currently just median.ccf for each cluster) [LM002_cluster.counts.xlsx]

Attached example of my cluster.counts file (https://github.com/hdng/clonevol/files/2533239/LM002_cluster.counts.xlsx)

Here is my script:

pyclone.directory <- '/Users/amandafitzpatrick/Library/Mobile Documents/com~apple~CloudDocs/DOCUMENTS/E57 exome sequencing/2018-08-30_results_ascat_pyclone/pyclone' output.directory <- '/Users/amandafitzpatrick/Library/Mobile Documents/com~apple~CloudDocs/DOCUMENTS/E57 exome sequencing/2018-08-30_results_ascat_pyclone' sample.sheet.file <- 'sample_annotation.txt'

min.mutation.count <- 30 cancer.genes <- scan('/Users/amandafitzpatrick/Library/Mobile Documents/com~apple~CloudDocs/DOCUMENTS/E57 exome sequencing/2018-08-30_results/Exome Sequencing/COMBINED list Stratton plus Caldas.txt', what = character()) patient.id <- 'LM002'

loci.file <- file.path(pyclone.directory, patient.id, 'output', 'tables', 'annotated_loci.tsv') loci <- read.table(loci.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)

sample.sheet <- read.table(sample.sheet.file, header = TRUE, sep = '\t', stringsAsFactors = FALSE)

clonevol.data <- loci %>% mutate( vaf = 100*cellular_prevalence/2, is.driver = symbol %in% cancer.genes & 'exonic' == func & 'synonymous_SNV' != exonic_func ) %>% select(mutation_id, cluster_id, sample_id, vaf, symbol, is.driver) %>% spread(sample_id, vaf);

n.samples <- length( unique(loci$sample_id) ) if( 1 == n.samples ) stop('Need more than one sample for ClonEvol!')

cluster.counts <- loci %>% group_by(cluster_id) %>% summarize( count = n()/n.samples, min.ccf = min(cellular_prevalence), median.ccf = median(cellular_prevalence), mean.ccf = mean(cellular_prevalence) ) %>% ungroup() %>% filter(count >= min.mutation.count) %>% arrange(-median.ccf)

recode.values <- 1:nrow(cluster.counts) names(recode.values) <- as.character(cluster.counts$cluster_id)

clonevol.data <- clonevol.data %>% select(-mutation_id) %>% filter(cluster_id %in% cluster.counts$cluster_id) %>% mutate(cluster = recode.values[ as.character(cluster_id) ] )

hdng commented 5 years ago

Hi @amanda-fitz,

I am not sure if I understand your question completely, but:

(1) clonevol doesn't perform clustering. It takes the clustering from pyclone and reconstruct the concensus clonal evolution tree and estimates the clonal admixture for individual samples.

(2) clonevol can use/estimate both median/mean CCF. There is a parameter called cluster.center in infer.clonal.models function that takes either a string "mean" or "median".

amanda-fitz commented 5 years ago

Hi thanks for your reply and explanations.

My question is actually very simple but perhaps I didn't explain well.

I would like a numerical output for the variant cluster plot. So from the example below


I would like the Cluster number (i.e. cluster number assigned by ClonEvol, on here 1,2,3, etc, which I understand comes from pyclone cluster just assigned a new ID) and for each cluster, the median CCF by sample type. My script generates a 'cluster.counts' data file but it contains only a single median CCF output and the pyclone cluster ID. I imagine it would be straightforward to obtain a data file given this data is used to make the cluster plot?


From: Ha X. Dang notifications@github.com Sent: 31 October 2018 18:49 To: hdng/clonevol Cc: Amanda Fitzpatrick; Mention Subject: Re: [hdng/clonevol] cluster.counts (#25)

Hi @amanda-fitzhttps://github.com/amanda-fitz,

I am not sure if I understand your question completely, but:

(1) clonevol doesn't perform clustering. It takes the clustering from pyclone and reconstruct the concensus clonal evolution tree and estimates the clonal admixture for individual samples.

(2) clonevol can use/estimate both median/mean CCF. There is a parameter called cluster.center in infer.clonal.models function that takes either a string "mean" or "median".

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/hdng/clonevol/issues/25#issuecomment-434798869, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AqkPftadSP2sl3zWlIMj4zYgJX7MY4vMks5uqfDGgaJpZM4YDhlT.

The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.

hdng commented 5 years ago

1) ClonEvol doesn't reassign cluster IDs, so the IDs in all plots should match those from Pyclone. One thing I should note is that ClonEvol requires contiguous integer as cluster IDs with "1" set as founding cluster.

2) If you are looking for (the confidence interval of) CCF estimate of the clones within individual samples, these are encoded in the output data frame for the model, eg.

y = infer.clonal.models(...)

# CCF of clones in the first tree,
guoxueyu commented 4 years ago

hi @amanda-fitz @hdng when i use ClonEvol to visualize my data(two samples),i encounter the problem @amanda-fitz described above,why the Cluster number in ClonEvol aml1$variant is the same in P and R? and how can i build the input data using the two different samples's Pyclone clustring results given that they have different cluster name?


here are the resultls of the two sample : (1) the cluster results of sample_one : Gene_site CCF CCF_id Cluster_id SPTBN4_19:40993610 0.000577717970064617 0.004905701953290807 0 SSSCA1_11:65339085 0.000606208481679261 0.007941526831518725 0 STARD10_11:72466059 0.0005419499192131997 0.00166324568158397 0 STRIP2_7:129098248 0.0005328553104007415 0.0009215148845033685 0 TLN2_15:63054620 0.0005334327589028977 0.0010294553920328 0 TLR3;FAM149A_4:187038619 0.0006052806746053058 0.007838195365478317 0 TMEM200A_6:130762164 0.0005869335930044629 0.005927266391566769 0 TMTC2_12:83358854 0.0006003953976641044 0.007264025382633757 0 TOP2B_3:25671580 0.0005748332535731366 0.0047982396078397925 0 TRPM2_21:45795753 0.0006065139423596613 0.007957071492527208 0 TTN_2:179664351 0.0005664532000250138 0.003969393376836105 0 TUBB4A_19:6495416 0.0005567039496530104 0.003026979374553545 0 ZBTB12_6:31868133 0.0006164475427949428 0.008866946468675582 0 ZIC4_3:147109962 0.0005951498207536907 0.006814823106293941 0 ZKSCAN3_6:28327605 0.0005654365502695898 0.0036983320975591282 0 ZNF343_20:2464396 0.0005499692553414132 0.002405898570181837 0 ZNF701_19:53086167 0.0005536185073358852 0.002725997394167213 0 ZNF749_19:57956145 0.0006049772129360563 0.007697991223585969 0 ZNF841_19:52569832 0.0006196443033787964 0.00921261354807395 0 ZNF843_16:31447342 0.0005366334034009819 0.0012061584107274804 0 GPR83_11:94129586 0.0006347704608802662 0.006913299009497001 1

(2) the cluster results of sample_two : Gene_site CCF CCF_id Cluster_id TOP2B_3:25671580 0.2679976638493413 0.08677916938910878 0 TRPC3;KIAA1109_4:123075318 0.2700058093836503 0.08390908294681121 0 TUBB4A_19:6495416 0.249111904572308 0.06286147278161541 0 ZIC4_3:147109962 0.2595193323398667 0.07319550876698379 0 ZNF343_20:2464396 0.2904487190531294 0.11464440159241955 0 ZNF701_19:53086167 0.2564645807826187 0.07008302067765426 0 ZNF749_19:57956145 0.2824362473561567 0.07832321938377844 0 ZNF841_19:52569832 0.26211001388249444 0.07244177517264004 0 MSR1_8:16001067 0.4011107153840252 0.10934799632753751 1 SCNN1G_16:23226531 0.4166061871271859 0.11698129469678656 1 IGF2R_6:160467624 0.294294450939142 0.07744067217978816 2 ADGRG7_3:100373931 0.29995597026745097 0.07158426019495136 3 FBXL19_16:30941644 0.3963851144129694 0.11279595292023216 4 PTPRT_20:40827887 0.39031419763822717 0.1013499009563431 4 ZNF843_16:31447342 0.41546292701815574 0.12776067211779962 4 OR52A1_11:5172907 0.3001542295655507 0.07641573409434724 5 MAT1A_10:82040067 0.29114574474311355 0.07695134578642827 6 DLG1_3:196910782 0.3503502187183945 0.14386002892864064 7 SERPINB12_18:61231325 0.3544039453594735 0.12744488185490602 8 SMAD4_18:48593504 0.3751574425813409 0.11794786469061391 9 POLN_4:2172442 0.2941529871391146 0.07363903851906502 10 NAPG;LINC01887_18:10605605 0.3134554089627867 0.10826041109866213 11 ZBTB12_6:31868133 0.6114864929361972 0.02881686518027521 12 C17orf99_17:76161546 0.2950922376368506 0.0773892683946796 13 MYLK4_6:2683370 0.30251509465459386 0.0671273900328978 14 PHF20L1_8:133824901 0.27898538053218513 0.06688004361287586 15 GPR83_11:94129586 0.28794296706965705 0.07382748745568672 16 ZKSCAN3_6:28327605 0.5372857320231438 0.06788054734282246 17 GLDN_15:51696672 0.06772532068056986 0.030602432692736777 18 HLA-C_6:31239613 0.06388765142821265 0.02212462084329112 18 MYOD1_11:17741386 0.06654192335768135 0.027226858496019704 18 LRP1B_2:141291599 0.29400052351621403 0.07200863303482705 19 SOX4_6:21595694 0.29829178384604516 0.06795922586380174 20 SNHG28_1:159805417 0.29787811581751417 0.0782328817670741 21 MPPED2_11:30432357 0.28472389887067157 0.06739179734110781 22 KLHL40_3:42727844 0.25080231841198364 0.11814284530406437 23 TP53_17:7578212 0.32269285539730724 0.12890300392747847 24