greenelab / core-accessory-interactome

Investigating the functional relationship between P. aeruginosa core and accessory genes.
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Create expression compendia based on mapping rates #19

Closed ajlee21 closed 3 years ago

ajlee21 commented 3 years ago

This PR adds a new directory to process data for use in downstream analysis. Specifically:

  1. 0_decide_thresholds.ipynb plots distribution of mapping rates to determine thresholds to use for mapping
  2. 1_create_compendia.ipynb creates compendia based on mapping rates
  3. 2_validate_compendia.ipynb visualizes binned data to confirm out method works as expected.

The plots are showing the median expression of PAO1 genes (PAO1 accessory genes) on the x-axis and the median expression of PA14-only genes (PA14 accessory genes) on the y-axis. Each point is a sample.

Here is using all the samples: image

If we binned our samples accurately then for samples within our binned PAO1 compendium, we expect that samples will align along the PAO1-only axis. Similarly, for samples within our binned PA14 compendium, we expect that samples will align along the PA14-axis.

Here is the result using a mapping threshold of 30% and difference in mapping rate of 2% image

image

If we increase the mapping rate and difference in mapping rate threshold, we start to get a more clear separation between PAO1 and PA14 samples, but the size of the compendia are limited

One alternative solution would be to use the median accessory expression to bin samples instead.

Note: Why are the number of PA14 samples much lower compared to PAO1? Would have thought that the mapping rate using the PA14 reference would be skewed toward 0, but the distribution of mapping rates are similar between PAO1 and PA14.

ajlee21 commented 3 years ago

Looks like there are fewer PA14 samples with high PA14 mapping, which explains why we see such a reduced number of PA14 binned samples. We may need to used different thresholds for PAO1 and PA14.

image

Looks like the misclassified samples have fairly high mapping rates to PAO1 and PA14 references. They are not just around the threshold set. image

image

Looks like mapping rate may not be the best method to bin samples. Let's try using median accessory expression instead.