Initial draft for gallery of Seaborn Figure level to CanD axes level plots

jankaWIS commented 3 years ago

Hi @mwshinn,

as we discussed in this issue, here is the first draft of Seaborn's figure-level plots done in CanD (and I left the matplotlib there as well). I have a few comments/issues. I will try to work on some other gallery plots if time permits.

Comments:

1) The CanD plots look different -- the edges are bold and the colours are darker or deeper. Any idea why? 2) How to do sharey or sharex in CanD? Relevant for some graphs (eg. 10) 3) is this format ok? I have looked at your docs and so far you have been using rst.

jankaWIS commented 3 years ago

I managed to also add the replot part. That brings me to

CanD is also changing the legend font and the scatterplot point sizes, why?
Would you know a better way how to go around FacetGrid? Namely look at sol for problem 3 and 4 in the replot section. For matplotlib it's ok but for CanD one has to specify the points and it's becoming less nice and less simple.

mwshinn commented 3 years ago

This is looking awesome!

Responding to your questions:

The reason they look different in CanD vs pure matplotlib is likely due to the fact that the backend code for displaying in jupyter operates slightly differently. If you save both to .png or .pdf files outside jupyter, they should look identical.
To share the x and y axes, see ax.sharex() and ax.sharey()
It is possible to export from jupyter to .rst, and I think there is a way to directly include it in the documentation build process. I can take care of those.
The label for the legend looks like a bug in CanD so I will fix that.
Take a look at c.add_grid()

A few more things:

For the motif:

c.add_axis("axname", ...) ax = c.ax("axname")

You can instead do:

ax = c.add_axis("axname", ...)
I don't think it is useful in this context to have the pure matplotlib version (without CanD or seaborn), so those parts can be removed.
As of matplotlib 2.0, the default color is the same in seaborn and matplotlib, so you don't need to explicitly specify the seaborn colors. You can access it by using 'C0' in place of the tuple of floats in the current PR. See: https://matplotlib.org/stable/users/dflt_style_changes.html
Please remove .gitignore and doc/gallery/.ipynb_checkpoints/seaborn_fig2ax-checkpoint.ipynb from the PR. You can do this with:

git rm .gitignore git rm doc/gallery/.ipynb_checkpoints/seaborn_fig2ax-checkpoint.ipynb

When you make a new commit, you can do "git add [name of file]" and then "git -m '[description]'" to make sure the wrong files don't get added. Alternatively, when you make a change for a file that is already in the repo, you can do "git add -u" to add all changed files but no new ones.

Thanks for your work on this!

jankaWIS commented 3 years ago

Thank :), I'll work on the changes you proposed and update the file once it's done. I have now a short question -- why do you want me to remove the gitignore? Is there some reason why not to have it? I'm still relatively new to git(hub), mostly new to any sort of collaboration/contribution so I do not know the "good practise" if this is part of it.

And one more question to 2) sharey, sharex. If I do a for loop, usually I do not have/save the axes separately hence I cannot do a simple ax1.sharex(ax2). I can go around with the following:

c.axes["Dinner"].sharey(c.axes["Lunch"])

and alternatively then in a for loop I can write some if else to take care of good sharing (if there are more cols/rows). Do you prefer this or do you have a better idea?

mwshinn commented 3 years ago

Sounds great!

For question 1, .gitignore is to avoid git adding unneeded files when you automatically add lots of files, or when you list the status. For a project of this size, it isn't really needed, since there should never be mass additions to the repo or lots of temporary files. I may add one in the future, but this is a bigger decision related to core maintenance of the project. Additionally, these decisions shouldn't be coupled into unrelated PRs. You are welcome to use your own .gitignore file locally, but just don't add it to your commit.

For question 2, it is better to do:

c.ax("Dinner").sharey(c.ax("Lunch"))

jankaWIS commented 3 years ago

Will do/done. Thanks.

I have been now playing with the sharing. It has a problem with sharing an axis more than one so chaining the previous won't work. I have made the following workaround:

    if i>0:
        # sharey, take the before last axis and sharey with the last plotted one, cannot share 2 axis
        c.ax(a[i]).sharey(c.ax(a[i-1])) 
        # need to rescale all axis
        c.ax(a[i]).autoscale()

        # remove ticks from shared y axes
        plt.setp(ax.get_yticklabels(), visible=False)
        # remove ylabel
        ax.set_ylabel('')

and I think it does the job but at the same time, I'm not 100 % sure that it won't break eventually, that the axis will have correct labels/ticks and scale. Could you please check?

mwshinn commented 3 years ago

I have never done axis sharing myself, but regarding sharing it more than once, did you try putting the base axis (say ax0) in both positions, i.e., both:

ax0.sharey(ax1) ax0.sharey(ax2) ...

and

ax1.sharey(ax0) ax2.sharey(ax0) ...

I think one of these should work but not the other. If one of these work I think that is the more robust solution.

You can always check any of the solutions by plotting an extra point on one of them and seeing if it updates the others too.

jankaWIS commented 3 years ago

You are right, this c.ax(a[i]).sharey(c.ax(a[0])) works and the other way c.ax(a[0]).sharey(c.ax(a[i])) gives the ValueError: y-axis is already shared. How did you know? Why one should and the other should not work?

mwshinn commented 3 years ago

It sounded like, since you can't share an axis that is already shared, the function inserted some kind of link in one axis to keep it synced with another one. It makes sense to be able to set multiple things to the same source, but not multiple sources to the same thing.

jankaWIS commented 3 years ago

Hi @mwshinn, I have added one plot for Facetgrid, see plot 3, using the add_grid from CanD and it is a bit ugly. Both the code and the fact that resizing didn't work well or nicely. Or I cannot figure it out. Would you have a trick how to do that nicely?

I also though about Seaborn's pairplot and I think one would need a function for this because the customisation is really hard.

mwshinn commented 3 years ago

Looks good! A couple things you can do to improve the code in section 3:

Rather than adding a "|" between elements, you can make a tuple in your list comprehension, e.g. [(a, b) for ...] instead of [a+"|"+b for ...]. Or even better, "list(product(b,a))" should do the trick.
Then, you can do "for i,(co,ro) in enumrate(conditions)" instead of the current for loop
To hide tick labels, "ax.set_xticklabels([])"
For selecting data from the pandas dataframe, tips.query() might be cleaner.

What do you mean that resizing doesn't work nicely? It looks good to me. If it is easier, you can auto-size axes in add_grid with "spacing=" instead of "size=".

I actually don't think pairplot would be as hard as you say. If you loop through i and j indices, you can plot a histogram if they are equal and a scatterplot if they are not.

jankaWIS commented 3 years ago

That was my original thought but then it starts breaking the other parts (or making them complicated), eg: c.add_grid(conditions, 2, Point(0.5, 0.5, "in"), Point((2*len(a)), 3.8, "in"), spacing=Vector(.5, .5, "in"), unitname="grid") or ax = c.axes[name] or sharing the axes c.ax(conditions[i]).sharey(c.ax(conditions[0])) becomes much uglier since the tuple does not work and then one has to somehow regenerate the conditions on fly.

Regarding tips.query(), I'm not sure it's cleaner since a) one has to deal with variables in the string/query, and b) I think people are in general used to loc rather than query. But if you prefer, I can rewrite it.

Oh, I missed the spacing key, yes, that's what I had in mind :). Thanks!

Well, that's just half of the job. The strength comes with the fact that one can map anything to anywhere, not only having hist and kde on the diagonal but doing things like:

g = sns.pairplot(penguins, diag_kind="kde")
g.map_lower(sns.kdeplot, levels=4, color=".2")

or mapping correlation values or any other function to it (this is PairGrid but that is basically the same).

mwshinn commented 3 years ago

Why doesn't the tuple work in the first case? Do you say it gets complicated because you need a string for the axis name? if you do list(product(b,a)), you can do c.add_grid([str(c) for c in conditions], ...). This makes things a lot simpler. As it stands now, you are creating your own homemade tuple! Or, if you don't want to name your axes like that, do c.add_grid([f"{a}|{b}" for a,b in conditions]). Or even simpler, my preference would be to just name your axes 0, 1, 2, ... based on the position of the tuple in the list.

For query, currently you have

tips.loc[(tips[row]==ro) & (tips[col]==co)]

vs

tips.query("@row == @ro and @col == @co")

I think the latter is much more readable.

Aah, yes, you're right. Perhaps implementing one basic case of it would be good enough, since people could then make the modifications as needed?

jankaWIS commented 2 years ago

Sorry, it took so long, too much work...

I fixed most of the things which you recommended, only the query I have not yet updated. There are now 3 figures of how to do what we discussed. I kept the two for illustration but also for one more issue. I actually have not yet figured out (no time :/) why it's happening but it you do what you suggest:

To hide tick labels, "ax.set_xticklabels([])"

then it removes all the ticks, not just the ones one updates in the if, unlike my original solution. You can see it in the 3 figures for plot 3.

I also finished the lmplots meaning we should now cover many of the examples from seaborn. Then we have to deal with the catplots.

Suggestions and comments are welcomed :). Also, the bug with the updated title-size is still there which I guess is related to another issue -- size of points in legend is not affected by seaborn's s parameter.

mwshinn commented 2 years ago

Thanks a lot for your continued work on this! Looking great!

Regarding the three, I don't think it's necessary to include the "Another way" part underneath "Preferred way". They only differ in how they are enumerating the conditions, which is more of a python-related difference than a CanD-related difference.

Aah, yes, you're right about the axis labels. It looks like this is because of the shared axes and your way is the recommended way of doing this. You can remove the "original way" too since my suggestion didn't work.

What do you mean by the following? "This is just playing with the same graph, one can do it with standard methods" I'm not sure I follow what the point of the last plot is.

Yeah, the title size thing isn't fixed yet. It isn't a quick fix so I need to find the time to sort it out.

jankaWIS commented 2 years ago

Thanks :).

Yes, I will remove them before the final version comes out, it was there mainly for you to see and maybe choose if you prefer another way in the examples than what I chose. There are still some other older matplotlib graphs too which I'm also gonna remove later.

What I mean is that it's basically the same as the 8th figure, the differences are in:

g = (g.set_axis_labels("Total bill (US Dollars)", "Tip")
      .set(xlim=(0, 60), ylim=(0, 12),
           xticks=[10, 30, 50], yticks=[2, 6, 10])
      .fig.subplots_adjust(wspace=.02))

which is just setting limits and labels or spacing and one can do it easily with the previous code, there wasn't anything new so I didn't create one. I can do it thou, not a big problem.

No worries, just I didn't want to forget it :).

I also by mistake created those Regplots, there's nothing special in them, they're not figure level plots but I realised only when I came to the end that there is nothing special about them and didn't want to delete them all. But I wanted to ask what would still be good to create. I'm planning to create the catplots but is there anything else you'd want? The list is here and I know we were discussing above the grid plots (pairplot, facetgrid, jointplot), have you made any decisions regarding them?

mwshinn commented 2 years ago

Oh, I see! Okay, yeah, we probably don't need that in the final version since it is more about seaborn than CanD.

mwshinn commented 2 years ago

Oops, clicked the wrong button while responding

mwshinn commented 2 years ago

I think, like you mention, the catplots would be great. I think a joint plot would be really nice in showing off what CanD is capable of, especially since people can easily adapt the different pieces of it, unlike in seaborn. I would prioritize that over a pairplot, since a pairplot seems to me more like an exploratory data analysis tool than something that would actually go in a paper. (Especially with how difficult it would be to support all of the functions.)

In practice, though, I would be grateful for any of these that you end up implementing!

jankaWIS commented 2 years ago

Catplot done.

Can I have a technical question? How do you orient yourself on the canvas? I find it really hard to figure out the coordinates and I basically do it by trial and error. But I guess there must be a much simpler way to do that.

jankaWIS commented 2 years ago

And I added jointplot too. Not fully done because there are some things I wanted to discuss and some I didn't have time for. I created a helper function for plotting the marginal distributions and left it quite open. There are many ways how that could be developed but I wanted to know your opinion first about this one. Should it have more parameters? Should it mimic seaborn closely or just provide a way how to solve this problem?

Also, for the hexbin (number 6), I was not yet able to find out how the colour is done and if it's a seaborn function. Because it looks like it is but I only found hexbin with jointplot which doesn't help us much. (eg. here

mwshinn commented 2 years ago

The catplots look really good!

I agree it is difficult to orient yourself on the canvas. I usually use the "make_grid" function, which is designed to make this a bit easier. Create a grid over the whole figure with your desired grid spacing, and then remove the grid when you are done laying everything out. However, I would love it if there were a smoother way of orienting on the canvas. I considered making a gui at one point, but realized this would quickly become much more complicated than CanD itself. :P If you have any suggestions for better ways to do this I would be open to new ideas.

I like the helper function you made, I'm certain that people will find this really useful! What kinds of parameters were you thinking about adding to it? If by mimicking seaborn more closely you mean by moving the legend, adjusting tick labels, etc., I would advise going for simplicity and mimicking the most important features, like you have done. If people want to move the legend, they can see your other examples where you do that.

One thing that would be really useful is a textual description of the jointplot helper function, e.g., what does it do, how does it do it, etc.

Looks like this is how seaborn does the colors: https://github.com/mwaskom/seaborn/blob/9425588d3498755abd93960df4ab05ec1a8de3ef/seaborn/axisgrid.py#L2212

jankaWIS commented 2 years ago

Thank you :). Also thanks a lot for the code, I couldn't find it (also a thing which I'm still learning how to do effectively), I updated the colour.

I added docs to the function but it already had some comments in it about what the lines do. How would you imagine it? And yes, it's exactly what you write, that's what I meant -- to mimic seaborn as closely as possible. Because I'd leave it where it's now with exactly what you say -- that the most important is the concept and the way how to create such a thing. Tweaking the details can be done by the user and for each context differently and there is not much added value to doing it here. So I'll leave it this way.

I couldn't find the make_grid function you talked about anywhere, which one is it and how do you use it? Regarding the GUI and canvas positioning -- wow, that would be great :D. But I guess that creating another adobe illustrator isn't exactly the goal here. I think that there could be a few things which would be nice to have. One could be having this grid and frame. It could be given as a param when creating the canvas (like a debug mode) which would draw you the frame of the canvas and label the coordinates (make a square grid) with some light grey lines. I think this would be good regardless and would already help a lot. If you want I can open another issue about this. The other could be more templates and examples, so that one can copy :). If you wanted to design a small GUI, I think that a relatively easy thing would be to allow the user to put some placeholders (eg rectangles) and give them names on a grid and then as an output, you would get a code for CanD specifying the grid -> it would give you the coordinates and locations of the objects, it would give them the names which you chose etc. Again, it's very simple in practice but helps a lot and would make it much more user-friendly.

mwshinn commented 2 years ago

The docstring you added is good! In addition, I think it would be useful to have a paragraph in the notebook (outside the code) describing why we need a function to do this, just to orient the reader who might scroll down to here and not understand why they are looking at a function instead of plotting code. Something like "This is non-trivial to do, can be useful for a lot of things, easy to encapsulate in a function and reuse, all the following plots use it,...".

Oops, sorry, I meant "debug_grid". This function does almost exactly what you describe in your first "nice to have" point. :P I fully agree with you on the "more examples people can copy" idea. :)

The GUI you described sounds simple, but there are some factors which make it more complicated than it seems:

1) Many of the features are computed based on input, e.g. for images, the bounding box depends on the image width and height. So these features would have to be computed by the GUI. 2) Many measurements depend on what is plotted, e.g., coordinate systems based on images, and ones based on data points (the name of an axis with "axis_" prepended to it), do not have a fixed position unless the data are already known and plotted. 3) If you just position objects, you don't know what coordinate system is being used to position them. You should be able to position things relative to each other or relative to the figure so that they adjust properly when you move elements around or resize the figure. E.g., put the text 1 cm to the right of the axis. This is hard to specify in a GUI. 4) If the GUI only supports the most basic feature (position in inches with no relative positions), this will implicitly become "idiomatic CanD", and using only one of CanD's features kind of defeats the purpose of using CanD in the first place. 5) Some plot elements lie outside the bounds associated with those elements, e.g., when an axis is positioned, the axis labels lie outside the given bounds. So users will still have a problem making sure elements don't collide.

There might be some more, but these are the main ones which have stopped me from making a GUI.

jankaWIS commented 2 years ago

Found it, thanks, yes, that's almost as I thought of it :). I would still have a comment on the debug_grid function. It does the job and I saw you can also move it around, which comes in handy. The problem I had with it was that it's still hard to orient oneself on the grid. Maybe if every fifth/tenth line would be a bit thicker/darker? Or adding labels would also help -- like numbers from 0-n based on the lines/spacing.

Hmm, I see, I didn't think of those problems. Some problems could be relatively easy to fix (eg. if you want a position relative to another graph, just add a checkbox or let the user type the name of the axis/plot it should relate to) but for most, I don't have a good answer.

mwshinn commented 2 years ago

It is already possible to do every 5th/10th/n-th grid by calling the command twice, once with a spacing of the small unit and once (with thicker lines passed as an argument) for the larger unit. Numbers are a good idea, maybe I will try some things and see if I can get it to look good.

mwshinn commented 2 years ago

Hi @jankaWIS !

I think this looks just about ready to go to add to the official documentation. I did a test run of what it will look like in the documentation and the results are attached as a png below. I think the following are the remaining things left to be done before it is ready to merge:

Clean up the organisation, e.g., is it a good idea to have the debug grid there?
Make the titles a bit more informative - instead of numbers as titles, perhaps have something which will help people locate the plot they are interested in
Finish the things you've marked TODO in the file
When you are done, rename the file to "seaborn.ipynb" and create the rst version using nbconvert, putting the images in a directory called "jupyter" in the "_static/images" directory. (If you don't know how to do this, I am happy to do it for you.)
It would be great if I could give you some public credit for all the work you've put into making this - could you please add a "created by" type statement to the page in whatever place/phrasing you are most comfortable with? Maybe right below the title?

No rush of course, just wanted to leave this here for whenever you get around to looking at it again.

Thanks!

https://user-images.githubusercontent.com/951986/163253854-cd49d83f-f86c-4d97-bb8f-bd9f1e869be3.png

jankaWIS commented 2 years ago

Hi @mwshinn,

yes, I totally agree.

Clean up the organisation, e.g., is it a good idea to have the debug grid there?

That's a good question. I would say no but also I think it's very useless and I was missing it when I was reading the docs (a few months back), so if it's somewhere well-visible, then no. But I think it should be somewhere.

Make the titles a bit more informative - instead of numbers as titles, perhaps have something which will help people locate the plot they are interested in

I don't really know how to do that. Seaborn doesn't have them, it has some descriptions but to be honest, whenever I needed something, I just scrolled until I saw the figure I needed. I never searched the text. The numbers reflect how it goes in seaborn docs and, I agree, aren't very informative. On the other hand, they just make a separation, chapter-like look. I am happy for suggestions.

Finish the things you've marked TODO in the file

This is tricky. Many of them are CanD TODO :D. Namely, the changes of font and graphics in CanD is point 4 in https://github.com/mwshinn/CanD/pull/13#issuecomment-871710633. And I think this is the only todo there. There is some potential todo but that is more costly to solve than to leave it to the user -- it is generalising the code to account for possible divergences (more/less data points).

The last two points I will do. Maybe I'll need some help with the before last point, I will try and let you know. I will go over it once again and see if I cleaned up everything. And btw., there is a watermark with my name at the very end.

Thanks

mwshinn / CanD

Initial draft for gallery of Seaborn Figure level to CanD axes level plots #13