Site filter performance test using released data

leehart commented 3 years ago

Task https://quire.io/w/malariagen-ag3.0-paper/87

review-notebook-app[bot] commented 3 years ago

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

leehart commented 3 years ago

Currently takes 6.5 hours to run. (No assistance from KubeCluster or Numba.)

Means with SD for all crosses and chroms (training sites excluded).

Same with training and Ag2 crosses removed only seems to make a slight difference to the error bars:

Notebook also currently includes example plot of metrics {TPR, FPR, J} for crosses {73-2, 78-3} and chroms {2L, X}:

leehart commented 3 years ago

Thanks for the review Nick.

Some of what you're suggesting sounds like coding style preference to me, with different values traded off. I might personally find this style of procedural code easier maintain, easier to follow, easier to reason about, and vice versa. But there is obviously some objective room for improvement, particularly in the metrics of speed and efficiency, which should be achievable without obscuring the data narrative too much - I'm conscious of that and I'm still looking into it. I took your original code and re-wrote it in a way that I could easily follow, mindful that others would need to follow it too, sacrificing those performance elements until I understood them better.

For instance, I would worry that using da.where() to flip the training sites to -1 in the ground truth arrays would subvert the original semantics of -1, which indicates that a site's good/bad status is undetermined, wrt Mendelian error or male heterozygosity. Although that might be an efficient performance hack, and it might simplify/shorten some code downstream, it might also make the data provenance harder to track, and the code narrative harder to follow, and could actually create more complexity/confusion in those areas. Of course, those perceptions are subjective, and a simple code comment to explain would probably suffice!

Having said that, of course I'm trying to see it all from your point of view, and take everything on board, and hopefully I can synthesize something that appeals to both our tastes, both aesthetically and technically, and then we should land on common ground, in a better place. :-)

hardingnj commented 3 years ago

Cheers Lee. Thinking about it, I think we could even include the training set mask in the function that generates the ground truth. I think it would fit there.

It is definitely a style thing- I don't think there are any issues with the content of the code. I do feel though that its always better to do all the intermediate steps/processing together so the logical flow is clearer. I find it hard to follow all the dict() objects floating around.

leehart commented 3 years ago

I find it hard to follow all the dict() objects floating around.

Too Many Dicts On The Dance Floor - Flight Of The Conchords

leehart commented 3 years ago

@hardingnj I've added a version of the notebook (and its output) that is simply a copy of your vector-ops one with minimal changes: it sources the publicly released data (via included the Python modules); it only compares Ag3 against phase 2 and 100% pass; and it excludes the training data using your da.where suggestion. On a user pod, it runs about 8 mins faster than my dict-centric-with-intermediates version. I also removed the cluster dependency, just to simplify. Let me know if that was important to leave in.

I expect that's what you wanted, but I still find that approach too cryptic and convoluted. It's funny because you have a similar impression of my version, which I tried to make more step-by-step and easier to follow! I'm very reluctant to over-engineer this, especially if it costs readability. This is just a little script that fetches some data (it's not like another multinational e-commerce CMS) but I still value quality and craftsmanship, obviously! This seems like an awkward forum to be debating software engineering principles.

Again, this seems to raise the question: what is this repo for? If we only need to produce a table and perhaps a plot for the paper, then both these notebooks can do that, within the same time frame. Both versions also provide examples of how to access the publicly released data.

If we also need to provide examples of well-engineered and highly-readable notebooks, to show how the CSVs and SVGs were compiled, and to demonstrate best practices, then this is likely to be more subjective, even though we both have the same objective. I would personally lean towards readability and lean away from over-engineering, but clearly different people have different takes on the same object. If I'm asked to paint the picture, then we'll get it one way -- and if you're asked to paint the picture, then we'll get it another way, although we're both trying to be "synergistic". I don't know how important it is to nail this thing, but I expect consensus would help steer us through the middle... but then there's design by committee! (-:

hardingnj commented 3 years ago

Thanks Lee. Appreciate the effort that you have put into this.

I also removed the cluster dependency, just to simplify. Let me know if that was important to leave in

I think that's fine. If it's only 45 minutes- that's a reasonable simplification I think.

On a user pod, it runs about 8 mins faster than my dict-centric-with-intermediates version.

Well done on getting that time down in your version- it was 6 hours I think you said above, almost a 6x speed up!

it excludes the training data using your da.where suggestion

Cool- thanks. That's certainly a very welcome and not straightforward addition.

I guess the main thing is that this repo is: a) correct b) clear

I still feel that someone coming in cold would be able to more easily follow this version- but I appreciate this is subjective to an extent. I'm happy that we both arrived at the same result though. I don't think there is necessarily harm in softening our positions a bit. I recognise that I could be more explicit in my coding, and it's good to challenge that.

leehart commented 3 years ago

Thanks Nick.

I reckon 6.5 hours to 45 minutes is about 8.7 times faster! Numba FTW

I found your da.where suggestion to be a straightforward addition to your code, but with the aforementioned concern about corrupting semantics. Maybe I did something wrong, although the results seem OK. All the amends to that notebook seemed quick and straightforward, actually, because I wanted to keep changes to a minimum, to preserve provenance.

Main changes https://app.reviewnb.com/malariagen/ag1000g-phase3-data-paper/commit/779e80322f2b33747d8329ed211b55ddcd26a25d/
Fix error bars https://app.reviewnb.com/malariagen/ag1000g-phase3-data-paper/commit/cf2d9915287935342161697d2cd4496bd91d8ba1/
Exclude training sites (which you can see here doesn't actually make much difference to the mean values) https://app.reviewnb.com/malariagen/ag1000g-phase3-data-paper/commit/63a96e213e680ea45bb358f156b3e5ad075ce0c1/

Yes, I want to find ways to learn and appreciate the methods and benefits of doing all the intermediate steps together. That seems to be your fondness! And I need to see the problems with my own inclinations, using beaucoupe de dicts, etc.

I'm still learning Python. I'm not sure the ag3.py and ag2.py modules are properly engineered. The import dance seems a bit clunky for some reason. I tried to keep ag2.py consistent with the existing ag3.py at least.

malariagen / ag1000g-phase3-data-paper

Site filter performance test using released data #21