malariagen / ag1000g-phase3-data-paper

Other
1 stars 2 forks source link

q13PCA figure: Includes creation of population groups. #37

Closed hardingnj closed 3 years ago

hardingnj commented 3 years ago

Just moved some of Lee's code into this repo

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

leehart commented 3 years ago

I'm in the process of combining the three identical PCA-data-generating notebooks from https://github.com/malariagen/vector-ops/pull/1255 into one notebook, and adapting that to use the allele counts generated by https://github.com/malariagen/ag1000g-phase3-data-paper/pull/34 -- rather than generating the allele counts on the fly.

Those three notebooks currently only contain rudimentary PCA plots, just to check the data before storing it, and there is one other notebook that started to experiment with triangles on maps.

I just want to check that's not duplicating any work in this PR. This PR is only concerned with plotting, after the PCA data has been generated, right?

hardingnj commented 3 years ago

Yes- correct- I just wanted to play around with the colouring

hardingnj commented 3 years ago

Made some changes. Essentially I think it's potentially a huge time sink to get the triangle properly working. As a first pass I have implemented a PCA based approach that puts the 2D lat/lon onto a single plane, which I represent using the matplotlib jet colormap.

Let's run the PCA with this colouring, then decide whether the investment is worth it to go deeper.

hardingnj commented 3 years ago

Just wanted to surface this: https://storage.googleapis.com/jhub/pca_plots.html

The colours are defined by taking a decomposition of latitude/longitude, and I think it does a reasonable job of assigning colours to different goegraphical regions. I think it could be improved by a more granular scheme, but initial thoughts?

leehart commented 3 years ago

Just wanted to surface this: https://storage.googleapis.com/jhub/pca_plots.html

The colours are defined by taking a decomposition of latitude/longitude, and I think it does a reasonable job of assigning colours to different goegraphical regions. I think it could be improved by a more granular scheme, but initial thoughts?

Interesting (thought provoking) ! I think I'd want to see a colour reference/legend or something, to relate colours with corresponding geographical location/orientation. Or is it not as simple as east versus west, north versus south - the decomposition doesn't translate?

btw, the (0.00%) looks like an old bug where the explained variances weren't wired up properly.

First time I've noticed the jhub bucket.

hardingnj commented 3 years ago

I think I'd want to see a colour reference/legend or something, to relate colours with corresponding geographical location/orientation. Or is it not as simple as east versus west, north versus south - the decomposition doesn't translate?

Yes- that's the idea. I was thinking a map with colours at each collection location and surrounding region. It's broadly EAST-WEST with a small component of NORTH-SOUTH. If you plot the decomposition it's broadly a line from 9.30pm to 3.30pm if that makes sense.

btw, the (0.00%) looks like an old bug where the explained variances weren't wired up properly.

Just not implemented- was doing late-ish on Friday... wanted to get somethin to show.

First time I've noticed the jhub bucket.

Not sure if this is an appropriate use for it!

Anyway- we can discuss more at 3pm

hardingnj commented 3 years ago

Hi @leehart in this PR is a start at defining some populations in content/population_definitions.yml.

Subject to change, but should be a reasonable starting point for now

leehart commented 3 years ago

Great, thanks @hardingnj

leehart commented 3 years ago
hardingnj commented 3 years ago
* Naming the location_colours "population_colours" might cause confusion downstream, if locations aren't always synonymous with populations, and the source file is still named "location_colours.yaml". What was the reason?

This is a good point. Don't always assume I have a good reason for doing things!

* I can't see how `population_colours` is a property/attribute of the `release_data` class? It seems very separate to me.

I wanted to avoid stuff floating around in the global namespace. I think it pertains to the release data, so makes sense to include it there. It doesn't make sense outside this context?

* Isn't `population_colours` more like a class variable than an instance variable? Since we don't expect/want it to change between instances, and it's not getting declared in `__init__`. Makes sense to make it read-only (with no setter or deleter) though.

I think that's what @property does? I'm not too hot on decorators, but it should be read-only.

leehart commented 3 years ago

@hardingnj wrt my 3rd query, yes it is read-only, but...

Consider a mymodule.py like this:

my_global_var = 'default_global_var_value'

class MyClass:

  __my_internal_class_var = 'default_class_var_value'
  writable_class_var = 'default_writable_class_var_val'

  def __init__(self):
    self.__my_internal_instance_var = 'default_instance_var_value'
    self.__my_internal_instance_var2 = 'default_instance_var2_value'

  @property
  def my_instance_var(self):
    return self.__my_internal_instance_var

  @property
  def my_instance_var2(self):
    return self.__my_internal_instance_var2

  @my_instance_var2.setter
  def my_instance_var2(self, value):
    # maybe some logic here
    self.__my_internal_instance_var2 = value

  @property
  def my_class_var(self):
    return self.__class__.__my_internal_class_var

  @property
  def location_colours(self):
    return 'location_colours_value'

^ In which case my_instance_var, my_class_var and location_colours are read-only because they don't have any setters (or deleters), unlike my_instance_var2, which can be changed via its setter. As you probably know, the user will see the AttributeError: can't set attribute error if they try to set a @property that doesn't have a setter.

Even trying to hack the internal variables won't work. (Except with the writable_class_var example.) For example:

import mymodule
my_instance = mymodule.MyClass()
my_instance.__my_internal_class_var = 'new_class_var_value'
my_instance.__my_internal_instance_var = 'new_instance_var_value'
my_instance.writable_class_var = 'new_value'
print(my_instance.my_class_var)
print(my_instance.my_instance_var)
print(my_instance.writable_class_var)
default_class_var_value
default_instance_var_value
new_value

The location_colours are not being initialized in the __init__ constructor, but then that step anticipates instance variables who's values might change, which is used to set those variables to defaults or instantiation parameters, etc.

This lends itself to my question about this location_colours constant being implemented as an instance variable, instead of a class variable, because we don't expect/want it to change. Although it can't be accidentally changed in the current implementation as an instance variable, it didn't seem like a natural fit to me, probably because of how I'm conceptualizing classes versus instances.

The way the ag3 module is coded seems a bit odd to me on this. Maybe because it's confusing class variables and instance variables? (Or maybe I'm just a bit confused by this coding pattern. For example:

import ag3
my_instance = ag3.release_data()
my_instance.release_dir = 'foobar'
my_instance._all_sample_sets = 'bazbar'
print(my_instance.all_sample_sets)
print(my_instance.all_wild_sample_sets)
bazbar
['b', 'a', 'z', 'b', 'a', 'r']
hardingnj commented 3 years ago

Thanks Lee. Those comments very useful. I think you have addressed all this in your PR that was just merged #45 .

I'll remove my commits to ag3.py from this branch, and change code to work with the new version

hardingnj commented 3 years ago

Ready for review

leehart commented 3 years ago

Thanks Lee. Those comments very useful. I think you have addressed all this in your PR that was just merged #45 .

To be clear, merged #45 happened to remedy my question of location_colours in ag3.py, which related to this PR. But it didn't address all of the behaviour in ag3.py that I mentioned at the end of my comment, e.g. release_dir.

I can make another PR that tries to address those more tangential questions about ag3.py too, if that helps? Although there may be something that I'm missing.

hardingnj commented 3 years ago

Switched to WIP while I make changes to population names.

hardingnj commented 3 years ago

Ready for review. @jonbrenas these population definitions can be considered usable!

hardingnj commented 3 years ago

I'll keep this open until PR #47 is merged- then I'll update the colours

jonbrenas commented 3 years ago

It looks good to me. I am not sure I quite followed all the comments. Are there some populations that come from more than 1 sample set?

hardingnj commented 3 years ago

Thanks Jon.

It looks good to me. I am not sure I quite followed all the comments. Are there some populations that come from more than 1 sample set?

No, but I needed to account for the possibility that they could.

jonbrenas commented 3 years ago

No, but I needed to account for the possibility that they could.

I agree. The reason for my question mostly was that I use intake to access the various data sets and I thus wanted to check if I could open them 1 by 1 (which I think is easier with intake) or if I needed to merge all datasets into own before being able to use it properly. As you say, it might be worth building the aggregate one just to be more future proof.

hardingnj commented 3 years ago

superseded by #49