morganjwilliams / pyrolite

A set of tools for getting the most from your geochemical data.
https://pyrolite.readthedocs.io
Other
132 stars 36 forks source link

Making custom (ternary) `util.classification` #49

Closed mtb-za closed 3 years ago

mtb-za commented 3 years ago

Is your feature request related to a problem? Please describe. Is it possible to create a ternary classification (such as the USDA's sand/silt/clay, or QAP(F) one)? Looking at the source code, it seems to only have the x and y axes. Being able to add such classifier polygons to a plot would be quite nice for visualisation, as well as allowing for 'easy' classification of data that is often on a ternary plot.

Describe alternatives you've considered This might be possible using some of the sklearn bits, but the current TAS example is very nice and clean, and it would be nice to be able to do something similar.

I started with the following to create the USDA soil classification scheme (at the bottom of this page), modelled on the tas .json file.

usda_poly = {"name":"USDA",
 "axes":{"x":"Clay", "y":"Sand", "z":"Silt"},
 "fields":
     # just doing the one field as a test.
     {'clay':{"name":'clay'},"poly":[[100,0,0], [55,45,0], [40,45,15], [40,20,40], [60,0,40]]}
}

This should be called using something like usda = PolygonClassifier(name=usda_poly['name'], axes=usda_poly['axes'], fields=usda_poly['fields']['clay']), from my understanding of the PolygonClassifier constructor, but obviously this does not work, because I have three axes, not two.

This would possibly link quite nicely with #26?

morganjwilliams commented 3 years ago

Hey @mtb-za! I think this would be a good feature to add, it's currently not possible off-the-shelf based on what's there at the moment.

There's a little bit of complexity in it, as the current implementation is based on some matplotlib 2D-only components (as you note). To make this possible, there are a few options:

1) re-implement matplotlib.patches.Polygon and make a Polygon (or even TernaryPolygon) which is able to be added to a ternary axis. Polygons haven't been implemented explicitly for mpltern (which pyrolite uses for ternary plots), but there is the ability to 'fill within point bounds'. If we did implement something, I'd suspect Yuji (the author of mpltern) would b interested in integrating it there. 2) do some internal ternary-to-xy conversions and use 2D polygons which are anchored in axes coordinates rather than data coordinates. This might limit the use of some of the available mpltern features (axes re-limiting, for example). 3) sacrifice some generalisability and added functionality which comes with matplotlib out of the box, and instead just plot some ternary lines (maybe together with a ternary fill, after dealing with formatting/configuration separately) - in which case it's relatively straightforward. As this would be in data coordinates, I think the relimiting etc should still work as expected.

My preference would generally be for the first, but I'll need to look into what dots need to be connected in order to make it work. If it's going to be fairly involved, we could implement the last idea in the interim.

morganjwilliams commented 3 years ago

Hey @mtb-za! I've added an implementation for these ternary classification diagrams and added the USDA soil texture classification as an example. There's an issue with the docs builds at the moment but will also look to add an example diagram there for this, and think about adding a few more (e.g. QAP, potentially with a second upside-down axis for AFP..).

For the moment this uses the second solution above; there will likely be a few limitations but it's working for now and allows some generalization of these diagrams regardless of whether they're x-y or ternary.

mtb-za commented 3 years ago

Sounds really promising! I am happy to add a couple of standard examples (QAP, AFP, maybe the USDA soil plot although that is not geochemical so let me know if you would rather not have it) to the docs once I know how the basics works. Is it pretty much the same process as for the 2D case?

morganjwilliams commented 3 years ago

I added the USDA soil one as an example on develop - you can check out the config file here or the classifier here (note how it's much less work than TAS!), but it's basically the same as the 2D case with a few modifications to axes labels etc.

As for the docs - there's a third party package issue for the moment; it might take some debugging to sort out (or it might sort out when the next version of that package is released..).

morganjwilliams commented 3 years ago

@mtb-za I've addressed the build problem with the docs for now; let me know if you'd like to add some classifiers and/or examples before I release v0.3.1.

mtb-za commented 3 years ago

Just noticed that you already implemented the USDA one. 😅

It looks like it is much less work because you are not adding the fields to the plot? The add_to_axes() method is the bulk of the TAS code.

Thanks for this though, I am sure that it will be very useful.

I think it is worth adding the QAP ones (both for volcanic and plutonic rocks). I will look at doing a PR with those tomorrow.

mtb-za commented 3 years ago

Been working on this, and I have a possible concern that should not hold up getting the functionality released if you are close, because it might be a pain to fix properly.

The QAP diagram has both the left and right sides of the triangle going up to 100 at the top. This is not a standard ternary plot though.

cf, amongst many examples: and

This can be compared to the one that I have started creating: Figure_1

I think that this is different enough that it will not actually plot things correctly given how I am currently making the boundaries, unless I have missed an option to create a base plot like this?

morganjwilliams commented 3 years ago

Hey @mtb-za! The notation for the QAP plot is nonstandard and likely arranged this way to some degree because of the common QAPF format, but doesn't affect the actual mineralogical compositions - so overall it should be fine. The main thing to watch out for is that the lines projecting from the quartz axis represent plag/alkali-feldspar ratios, and aren't constant proportions of either - so the line on the left from (0, 0.9, 0.1) to quartz at (1.0, 0, 0) represents A/P = 9, and passes through (0.6, 0.36, 0.04) rather than (0.6, 0.3, 0.1). The only other issue is that the left axis is alkali feldspar rather than Anorthosite. Hope that makes sense, let me know if you run into any other issues! The past week has been pretty busy so the release is a little while off yet.

mtb-za commented 3 years ago

The only other issue is that the left axis is alkali feldspar rather than Anorthosite.

Whoops! Not sure how I did that....

I am taking a look at the other stuff now. Certainly more complex than I anticipated, but I think that I have the hang of it now.

Do you have a preference for how to handle the granite/syenogranite/monzogranite field? I assume (but have yet to test) that fields can not overlap, but syenogranite and monzogranite combine to be the granite area.

morganjwilliams commented 3 years ago

Thanks for starting a PR - I'll add some comments over there.

For the granite/syenogranite/monzogranite field, you're correct that it won't work as expected if you have them all individually (I think if granite is listed last, nothing would get classified as granite..). There's a few options for dealing with it - from only including the more specific terms through to doing something like TAS with two sets of labels or having a similar 'specificity' toggle (with each polygon potentially having multiple labels). These solutions simply swap out labels, not polygons - if for some reason you wanted to use different sets of polygons depending on the situation then it'd be a little more complicated - likely involving adding tags to polygons and modifying how the diagram is constructed based on these tags; this would ensure the visualization side of things was flexible and accurate. From the classification side of things, I don't see significant downsides to just having the more specific labels - you could just postprocess the classification to aggregate these two, with e.g.:

df.loc[df.QAP.isin(["syenogranite", "monzogranite"]), "QAP"] = "granite"