Closed CarlinLiao closed 7 months ago
I'm able to review this now that the DNS issues were resolved for oncopathtk.org, so this script can run.
I did systematic refactoring to break down the process and passed data structures into units, and to give the units approximately meaningful names. What remains is such refactoring inside what is now called SubplotGenerator
.
I also added caching so that on re-run one doesn't have to wait a few minutes for a new plot.
@CarlinLiao please check that this still works for you as expected. Also, you can finish the refactoring job (if you take a look it should be obvious what is to be done) if you have time. It should take about 1-2 hours. If not, we can merge as-is.
Although the refactoring looks like a rewrite, that is not really true, I did not implement any new behavior (except that possibly it may be considered new behavior that the plot specs are stored in separate JSON files).
I recognize this error but I don't remember what I usually did to solve it. If I remember correctly, it was something silly like psycopg2 or Postgres being installed. (It is, and my bash and make versions meet requirements.) Running on the server in an SPT conda env.
$ python analysis_replication/gnn_figure/graph_plugin_plots.py -h
Traceback (most recent call last):
File "/home/liaoc2/SPT/analysis_replication/gnn_figure/graph_plugin_plots.py", line 16, in <module>
from pandas import DataFrame
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/__init__.py", line 49, in <module>
from pandas.core.api import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/api.py", line 47, in <module>
from pandas.core.groupby import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/groupby/__init__.py", line 1, in <module>
from pandas.core.groupby.generic import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/groupby/generic.py", line 68, in <module>
from pandas.core.frame import DataFrame
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/frame.py", line 149, in <module>
from pandas.core.generic import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/generic.py", line 193, in <module>
from pandas.core.window import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/window/__init__.py", line 1, in <module>
from pandas.core.window.ewm import (
File "/home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/core/window/ewm.py", line 11, in <module>
import pandas._libs.window.aggregations as window_aggregations
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /home/liaoc2/miniconda3/envs/spt/lib/python3.11/site-packages/pandas/_libs/window/aggregations.cpython-311-x86_64-linux-gnu.so)
Hm this error seems to be triggered by the very first import from pandas: from pandas import DataFrame
. Can you confirm that the same thing happens in your environment by doing just this import? E.g. in the Python interactive CLI:
from pandas import DataFrame
If so I guess you'll need to do some environment-specific thing to make pandas available. The pandas occurrence might be a misdirection, however, as there are probably a huge number of functions requiring glibc (I think it is the implementation of the C standard library). So if that is masked or not available somehow in the conda environment, that's not good.
We haven't specified package versions required because these plotting scripts were meant to be dev/testing use only, but perhaps we do need to pin the version because of issues like this.
I think I have been using:
pandas==2.2.1
Simply updating my pandas and gcc versions in conda weren't enough to fix the issue, so I deleted and remade my SPT condo environment to resolve it, verified that it works with run_all and graph_plugin_plots, and committed its yaml.
The SVG plots appear to have been constructed properly. Here they are for reference:
I also adjusted the tqdm calls to be simpler and complete at 100% instead of being one off. We may consider adding the files created by analysis_replication to .gitignore as well as urothelial.txt.tmp, which is currently version controlled despite its tmp extension.
Noting what we discussed in person:
Probably after refactoring the SubplotGenerator for style conformity/modularity, we should move the main classes into the SPT codebase in order that they may be reused in the live application. It seems that this will also require moving or copying a minimal amount of the "accessors" stuff (and cleaning it up accordingly).
I thought moving graph_plugin_plots.py into SPT would be simple, but I now see two issues:
How should we proceed? The simplest option would be to change analysis_replication
so that it requires the installation of SPT to run, or at least include the relevant files from SPT the system path.
Let's just postpone promoting this plotting code to SPT codebase. The idea is that eventually we will want to be able to generate these plots server-side in the live application, from something like those PlotSpecification
objects (equivalently, those JSONs). But we aren't doing that right away.
The finished refactor looks good to me.
Can you just confirm that the caching thing works for you as well? I deleted my pickles and re-ran graph_plugin_plots.py
, and it worked again for me.
Caching works on my end, so we should be good.
That said, on my WSL machine the plots are coming out a little odd. I checked and I get the same narrow second urothelial plot even if I revert my refactor, so maybe it's just a difference in hardware?
Yeah I wouldn't worry about it, as I remember it matplotlib has a ton of different backend rendering engines that it might use depending on the environment. For the future reproducible workflow we will pin the environment to resolve this.
Cool. Are we ready to merge?
Yep, tests pass for me.
To close #306.