morrislab / phylowgs

Application for inferring subclonal composition and evolution from whole-genome sequencing data.
GNU General Public License v3.0
108 stars 55 forks source link

Extracting clonal SSMs #75

Closed underbais closed 6 years ago

underbais commented 6 years ago

Hello Jeff,

I was able to visualize the tree and plots for CCFs locally. Now my questions is how do I extract actual SSMs from those json files. Like, say I want to see what ssms a certain clone consists of.

https://github.com/morrislab/smchet-challenge cannot work with multisample data as stated in the readme

So, what would be the best way to parse clonal composition?

Thanks Chingiz

underbais commented 6 years ago

any hope here?

quaidmorris commented 6 years ago

Hi Chingiz, This is something that is a priority feature for us. Unfortunately, I am not sure when we will be able to do it. The information that you want is available in the JSON file -- you could try parsing it yourself if you are in a hurry.

AmitDeshwar commented 6 years ago

Hi Chingiz, If you look at the contents of mutass.zip it will have the information you're looking for.

Amit

On Wed, 28 Feb 2018 at 16:28 quaidmorris notifications@github.com wrote:

Hi Chingiz, This is something that is a priority feature for us. Unfortunately, I am not sure when we will be able to do it. The information that you want is available in the JSON file -- you could try parsing it yourself if you are in a hurry.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/75#issuecomment-369389023, or mute the thread https://github.com/notifications/unsubscribe-auth/AIZpG0oUfe3Fwe71nMsmm4vrRttkytcPks5tZcTxgaJpZM4SRm2j .

underbais commented 6 years ago

Hi Amit,

My mutass folder has 2499 json files. So my questions would be:

  1. what those files represent? (1 file = 1 ssm?)
  2. what is the structure of a file? I found only 2 clusters there in each file whereas the tree in the browser shows 5 clones
  3. is there ssms-to-clone assignments in those files?

Overall, the package is very non-friendly for users. Clearly not meant for reproducible research..Sorry

Thanks Chingiz

On Wed, Feb 28, 2018 at 5:07 PM, AmitDeshwar notifications@github.com wrote:

Hi Chingiz, If you look at the contents of mutass.zip it will have the information you're looking for.

Amit

On Wed, 28 Feb 2018 at 16:28 quaidmorris notifications@github.com wrote:

Hi Chingiz, This is something that is a priority feature for us. Unfortunately, I am not sure when we will be able to do it. The information that you want is available in the JSON file -- you could try parsing it yourself if you are in a hurry.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/morrislab/phylowgs/issues/75#issuecomment-369389023 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AIZpG0oUfe3Fwe71nMsmm4vrRttkytcPks5tZcTxgaJpZM4SRm2j

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/75#issuecomment-369400189, or mute the thread https://github.com/notifications/unsubscribe-auth/AeR-xQcEsNkkolmxfFVbi0Xzvsx5vwNBks5tZc43gaJpZM4SRm2j .

AmitDeshwar commented 6 years ago

HI Chingiz,

I'm sorry you find JSON hard to understand. Each json file represents a sampled tree, there should be 2500 if PhyloWGS was run with default parameters. Each json file contains a nested dictionary. You're looking for 'mut_assignments', which is a dictionary where the keys are cluster numbers and the values are two lists containing the mutations assigned to that cluster for that sampled tree.

On Wed, 28 Feb 2018 at 22:36 underbais notifications@github.com wrote:

Hi Amit,

My mutass folder has 2499 json files. So my questions would be:

  1. what those files represent? (1 file = 1 ssm?)
  2. what is the structure of a file? I found only 2 clusters there in each file whereas the tree in the browser shows 5 clones
  3. is there ssms-to-clone assignments in those files?

Overall, the package is very non-friendly for users. Clearly not meant for reproducible research..Sorry

Thanks Chingiz

On Wed, Feb 28, 2018 at 5:07 PM, AmitDeshwar notifications@github.com wrote:

Hi Chingiz, If you look at the contents of mutass.zip it will have the information you're looking for.

Amit

On Wed, 28 Feb 2018 at 16:28 quaidmorris notifications@github.com wrote:

Hi Chingiz, This is something that is a priority feature for us. Unfortunately, I am not sure when we will be able to do it. The information that you want is available in the JSON file -- you could try parsing it yourself if you are in a hurry.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/morrislab/phylowgs/issues/75#issuecomment-369389023 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AIZpG0oUfe3Fwe71nMsmm4vrRttkytcPks5tZcTxgaJpZM4SRm2j

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/morrislab/phylowgs/issues/75#issuecomment-369400189 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AeR-xQcEsNkkolmxfFVbi0Xzvsx5vwNBks5tZc43gaJpZM4SRm2j

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/75#issuecomment-369464035, or mute the thread https://github.com/notifications/unsubscribe-auth/AIZpG0FQrEMsKCctKlzD7EbdrTAEnkxVks5tZ2ytgaJpZM4SRm2j .

underbais commented 6 years ago

Hi Amit,

Thanks for your reply. Maybe a picture can help here phylowgs-questions Could you please answer those questions in the screenshot?

Also, I managed to parse json files. So, why exactly 2 lists of SSMs per cluster? And why only 2 clusters?

Thanks Chingiz

underbais commented 6 years ago

Hope to get answers soon .. Thanks

jarrybarber commented 6 years ago

Hi Chingiz,

To answer the questions from your picture:

  1. You can find the SSMs assigned to that node by opening up the mutass.zip file associated with this run (pd15) and opening the json file associated with this tree, named after the index given to this tree, in this case 113.
  2. Yes, the json files are named after the tree indexes.
  3. The "Nodes" column corresponds to the total number of subpopulations that make up the sample in this particular tree, including the normal cells that make up the sample. The "0" node represents this normal cell population. So there is 1 normal cell population + 5 distinct cancerous cell populations = 6.

For each cancerous population we assign both SSMs and CNVs and so there are two lists associated with each cluster. I am unsure as to why you are only seeing 2 clusters as the number of clusters depends on the tree we are looking at. Perhaps if you sent us the output from write_results we can look into this.

Regards, Jarry

underbais commented 6 years ago

Hi Jarry,

Thanks for clarification. I reran the evolve.py and now see all clusters in trees. So finally, how do I pick up the best tree? And what nlgLH, LI and BI mean?

Chingiz

jarrybarber commented 6 years ago

Hi Chingiz,

The most immediate method we use for determining the best tree is to use nlgLH (normalized log likelihood) to determine the "best tree". The lower the nlgLH, the better that tree describes the given data. Though keep in mind that it is important to consider the other trees reported by PhyloWGS, because though they may not be the "best", the other trees could certainly explain the true phylogeny. We're working on new methods that we anticipate finishing in the next month to better summarize the different types of trees reported by PhyloWGS.

LI (Linearity Index), BI (Branching Index), and CCI (Coclustering Index) are indices which are determined by the structure of a tree. As their names suggest, these can be used to see how linear, branched or clustered a tree is. LI shows what proportion of mutations are in linear relations -- i.e., given mutations A and B, they're linear if A is in a population ancestral to B, or vice versa. BI shows what proportion of mutation pairs occur in different branches of the tree. CCI shows what proportion of mutation pairs are placed in the same cluster (i.e., population).

Cheers, Jarry