Add script to do exploratory data analysis on TCGA data

Thanks for the detailed feedback! Responses below:

Just wanted to clarify, when you say >5% mutated, >50 samples mutated this means you are looking for cancer datasets where all genes are mutated in at least 50 samples? So 5% mutated is equivalent to 50 samples being mutated? This is mutated with respect to some reference I assume. This is probably my lack of cancer background, but is there a reason you are using this filter?

No worries, this is a bit confusing if you're not familiar with the project. We're filtering gene/cancer type combinations. So in other words, say we're interested in TP53 mutations - we're looking for cancer types where at least 5% of samples and at least 50 samples have a mutation in TP53 (so the 5% and 50 samples are two separate filters, if the cancer type fails either of these criteria we drop it).

Then we do this individually for all other genes, in this case all other genes in the top 50 most mutated genes across all of TCGA (although I'll likely generalize this to more genes in the future).

The reason for this is more a ML/prediction-related reason than a cancer reason: we think it will be hard to build a good classifier if we have very few positively labeled samples, so we filter out those cases as a preprocessing step. In the future, we could do something special to handle these cases (like apply outlier detection methods or classifiers specifically designed for imbalanced labels) but for now we're just ignoring them and focusing on the cases where we have more balanced labels.

pancancer_data is a df of dfs?

It's just a tuple of dfs. I'll document this in the script, I agree that it is a bit confusing.

I assume pancancer_data contains the metadata associated with rnaseq_df?

Yep, exactly (info about mutations, mutation burden, cancer type, etc. for the samples in rnaseq_df).

Not sure if you want to update this print statement to refer to other genetic variation types in addition to mutations: print('Genes with mutation information: {}'.format(len(overlap_genes)))

At the moment, I'm just looking at point mutations. Integrating copy number information gets a bit complicated (we need to know whether each gene is an oncogene or tumor suppressor or neither to integrate that information), so I'm not using it here but I'd definitely like to add it in the future.

I assume the numbers in cancer_types_df refer to the number of samples per cancer type

Yep, exactly!

Still learning about the best way to define paths, so perhaps this is obvious. But is there a reason you're using resolve.() instead of os.join or defining this path in your cfg?

I've been using pathlib everywhere in this repo (I think Milton suggested it in one of my past reviews) so I'm just using it here (rather than os.path) for consistency. I could define it in the config, but I don't use this file that much elsewhere so I don't think it's necessary (for now).

You test_df is your filtered dataset

I meant to change this name to something more informative, thanks for reminding me (and yes, it's the filtered dataset)

Not sure if this is a worth while validation. But would you consider looking at the genes that passed your filter and do a quick sanity check that they correspond to those found to be highly mutated in publications?

What is your takeaway from the mutation plot trend? I can see that your unincluded diseases are clustered by the origin as expected. Though there are some points that fall below the 50 sample threshold, do you know why that is?

Looking at publications would definitely be smart! Created #18 to look into this.

My takeaways are: (I'll add these to the script too)

There are more gene/cancer type combinations that aren't filtered out than we expected (Casey guessed that most combinations would be filtered out). Since we're using the top 50 most mutated genes this isn't that surprising, but still good to know.
Some of the highly mutated outliers from this plot make sense based on other studies/previous knowledge (e.g. TP53 in ovarian cancer, KRAS in pancreatic cancer, TTN in melanoma as a marker of high background mutation rate), so it's a useful sanity check.

Nice bar plot..I assume you are comparing this distribution to some publication, but might be nice to make a note for your future self

It's mostly just a general sanity check - mutation distributions in lots of cancer types/previous publications have been observed as "heavy tailed" or "hockey stick-shaped" (a few common mutations and lots of less common ones). We see that here too for many cancer types, which makes sense.

Looks like you need < 20 PCs to explain > 0.99 of the variance in the rnaseq data. Do you think this is the same number of PCs you'd need if you looked at different cancer types individually...just me being curious. I wonder if some have more complex structures than others. Do you know which genes are highly weighted in these components that explain the variance? Perhaps that is another avenue for distinguishing cancer types

Good question! I haven't looked at cancer types individually. Created #19 to remind me about this.

A little hard to tell based because we're looking at data compressed onto 2 axis, but 100 PCs looks like it might have slightly better separation

Yeah, it's hard to tell. The plot does look a bit different than for higher numbers of genes, though, which tells me that at least some of the signal is being removed or altered at that point.

I understand the logic of using the MAD genes. This reminds of my generic-expression-pattern repo, where we find that some genes are commonly DE. We haven't looked into the reason but I wonder if these MAD genes are those that are common across all these cancers because they're hyper-responsive. I'm not sure if you're planning to use this moving forward, but if you are trying to find a signature that distinguishes between cancer types, you could try using a supervised learning approach on your compressed PC's gene expression data?

Yeah, this is definitely something we're thinking about. It's a very similar idea to what Greg did in his BioBombe paper, but the cross-validation he did was different (his train/test sets were both stratified by cancer type, but we're holding out single cancer types and comparing performance). It would be interesting to try his approach here as well.

Generally, his experiments showed that using raw gene expression data performed better for the mutation prediction problem than any method of compressing the expression data (e.g. PCA, NMF, VAE), but we're not sure if that would be the case here as well.

greenelab / pancancer-evaluation

Add script to do exploratory data analysis on TCGA data #17