Closed deriteidavid closed 3 years ago
First of all I think we should have a trivially intuitive way for the users to access the attractors and the succession diagram once it is generated. In the "run.py" code there is a nice code where the attractors are turned into a pandas dataframe.
import pandas as pd
df_attractors=pd.DataFrame(columns=[node for node in diag.unreduced_primes])
df_attractors=df_attractors.merge(pd.DataFrame(diag.attractor_fixed_nodes_list),how="outer")
df_attractors.index=["Attractor_"+str(i) for i,value in enumerate(diag.attractor_fixed_nodes_list)]
df_attractors=df_attractors.fillna("X").astype(str).replace({"0.0": "0", "1.0": "1"}).sort_index(axis=1)
df_attractors.to_csv("Attractors.csv")
We could turn this into a function withing the succession class, such as attractor_summary_dataframe()
that returns the dataframe.
What do you think?
While I agree that we need to improve the way attractors and so on are organized, I am not sold on the pandas approach -- isn't there a way to get the job done without an extra dependency?
What about a json approach? This is what PyBoolNet uses, so it would make sense that we try to follow a similar format. We could take a closer look at how they output attractors in general and try to mirror that to the extent possible.
The json (dictionary) approach is a good idea, however (in my sample of people using python) pandas is very common. We can have both as options. I'll propose an function for both and we can decide later what to keep.
I don't dispute that pandas is commonly used. But pandas can already import from json, so I don't see why both approaches would be needed. It would just be one more thing to maintain & version check going forward. What are the benefits to using a pandas DataFrame for this output? Wouldn't a more lightweight structure be sufficient?
If we want to save the data, we can export to json or the user can pickle the SuccessionDiagram object. If we want to use the data, that should be possible from the SuccessionDiagram object directly (and if it's not obvious how to use that class, we need to fix that first).
I do see your point of using JSON to be consistent with PyBoolNet and the extra dependency maintenance problem.
But, personally, I think having the output in the simplest human (and excel) readable format out-weights this. I think it is in the best interest of most users, who would just want to put in boolean rules and get as output a file with attractors + stable motifs + reprogramming interventions in a readable format. This is what the java StableMotifs program did, and the this was the reasoning behind it. I would also say this is even the case for me for when I want to do a first pass on a model.
I agree that we need some better tools for human-readable and csv outputs and that it should be easy to get an attractors DataFrame. Where I disagree is that I don't think we need pandas as a dependency to do this. I envision something like:
sm.write_attractor_csv(diag,"test.csv")
myFrame=df.read_csv("test.csv")
or,
myFrame=df.read_json( sm.attractor_json(diag) )
We could put stuff like this in a tutorial or example so people know they can easily use pandas if they want to, but pandas wouldn't be a dependency.
That being said, you might be able to persuade me if you have good answers to these questions:
As an aside, before writing the export functions for any format I think we should revise the way attractors are counted and stored internally (see #31). The code from run.py can miscount in certain situations.
I think I understand now exactly what you mean. I agree that something like what you suggest for the data frame output is a good way to avoid having to include pandas.
I am not committed to having to use pandas, and I do not think we necessarily need to use it to accomplish what you suggest. But I am not familiar enough with the other libraries we have to know if they can or how to do it. This is why one of the first things I did when I started running benchmarks was write the code Dávid mentioned above. For me, a dataframe format is a natural-enough way to store the attractors that has the advantage of being direclty human readable. And if we are using dataframes, I think pandas will make our (and other's) lives easier, particularly because it plays nicely with Jupyter notebooks. But I can definitely see why a JSON/dictionary approach might be preferable because of its compactness and because of the way stable motifs, fixed nodes, etc. are currently being stored.
I do get the idea that we want to keep this library neat and with as few dependencies as possible. However, it's a compromise to make things also somewhat "user friendly". We could have the exact same argument about using NetworkX to export the succession diagram, as it brings in an extra dependency, and storing network structure in json or a simple edge-list is also a clean way. I will attempt to answer Jordan's questions:
- What is it that we need pandas to do that we can't achieve with built-in csv/json tools and our other dependencies?
We shouldn't write anything into a csv and read it back again just to make the information human readable.
- Why should we export our data directly to pandas when pandas already has a bunch of its own import and conversion tools (i.e., why is the solution above, or something similar, not enough)?
As Jorge pointed out, most people will use Jupyter notebooks, and having a pandas df available without the hassle of making conversions is just convenient. When it comes to exporting stuff, that's where I think pandas has a significant advantage; especially if the model they analyze has funky variable names. I believe making exports for our own purposes it's also advantageous to use pandas.
Once again, if the main priority is keeping things neat and "pythonic" I too can be convinced.
I would like to raise another UX issue (we can create a separate thread for this) regarding the succession diagrams. Right now there is a number of functions in the Succession class that do some network related transformation, but for me, as a "user" it's unclear which does what. I managed to get it from run.py that I need two operations to get a "classic" succession diagram in a network form:
diag.process_networkx_succession_diagram(include_attractors_in_diagram=True)
G=diag.G_motif_based_labeled
The graph in the second line is empty unless we call the first function. I suggest the following modifications: I don't see a reason to store G_motif_based_labeled
in the diag object unless it's used by a bunch of other functions. In that case, the graph should be generated when the diagram gets built (build_succession_diagram). Also, in this case, we should return a copy to the user, not the pointer so they can play around with it safely.
Otherwise, the process_networkx_succession_diagram(include_attractors_in_diagram=True)
should return the graph object when called. I don't see any reason why it shouldn't be labeled by default. This applies to both kinds of succession diagrams.
Whatever the details, my main point is that it should be really easy to access the succession diagrams. In the vein of the pandas argument, I think a non-NetworkX edge-list or json format should be available too.
If you agree I'm happy to help with the modifications.
I have several points to make, so I'll try to stay organized.
My preference is to keep the scope as narrow as possible and the dependencies as few as possible, but thoroughly document how to interface with libraries via examples and tutorials. For me, every dependency is presumed superfluous until proven necessary.
I'm okay with networkx dependencies for two reasons: i) PyBoolNet already requires it, and ii) we rely on several networkx algorithms internally. In for a penny, in for a pound.
I am not as familiar with the succession diagram plotting code as Jorge is, but it seems like those issues are different enough that a separate thread would be a good idea.
I agree that the pathway (object) -> (text) -> (disk) -> (pandas) is silly if you're working in a notebook, but it might make sense in other circumstances e.g., collaborators working at different points in a pipeline. The (object) -> (text) -> (pandas) doesn't seem so bad to me, especially since it's a one-liner. Because of how the pandas importer works, either option will work if the other does (I think).
Python's print function and string class are reasonably powerful. I would be surprised if we couldn't generate some very nice human-readable output with it. Something like the top answer, or the python 3 version further down, here: https://stackoverflow.com/questions/9535954/printing-lists-as-tabular-data
I think that any dependency that is not essential to the core algorithm should be optional. So if we do end up including pandas, we should set it up so that it runs without pandas until you try to export to pandas, at which point it fails gracefully. Something like this:
try:
import pandas.DataFrame as pandas_df
except ImportError:
_has_pandas = False
else:
_has_pandas = True
def output(diag, type): if type == "json":
elif type == "csv":
elif type == "pandas": if not _has_pandas: raise ImportError( "Must install pandas to export to pandas data frame." ) else: return pandas_df.read_csv(output(diag,"csv"))
I would prefer pandas construction to use a set-in-stone standard format like json or csv so there's less chance of things breaking in 50 years.
By the way, my first job involved translating and/or replacing deprecated 50-year-old Fortran code that was supposed to "last forever". The people I was working for still relied on that code, and it had already been translated from punchcards once! So I've been burned by excessive dependencies before :)
From comments in other threads, it seems like a solution roughly similar to the one I suggested above (in 5.) and having pandas as an optional dependency is an acceptable compromise. Is everyone OK with that?
We don't have to decide on all the details until after the new attractor repertoire classes are implemented (#37).
This seems very reasonable to me.
On Fri, May 8, 2020 at 1:11 PM Jordan Rozum notifications@github.com wrote:
From comments in other threads, it seems like a solution roughly similar to the one I suggested above (in 5.) and having pandas as an optional dependency is an acceptable compromise. Is everyone OK with that?
We don't have to decide on all the details until after the new attractor repertoire classes are implemented (#37 https://github.com/jcrozum/StableMotifs/issues/37).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jcrozum/StableMotifs/issues/34#issuecomment-625920283, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADLEUXJVXAQCUBU2K2FZGATRQQ4MHANCNFSM4MRZIG7A .
-- Reka Albert Distinguished Professor of Physics and Biology Pennsylvania State University 104 Davey Laboratory, PMB 261 University Park, PA 16802 Phone: (814) 865-1141 Web: https://www.ralbert.me https://www.ralbert.me
Hi all. I added self.attractor_dict= dict()
to Succession class and a function that fills it generate_attr_dict(self,nodes,attractor_fixed_nodes_list,oscillation_mark='X')
. This gets called during the construction of the diagram and generates a dict with integer keys going from 0 to nr_of_attractors-1. The values are dictionaries representing the states with all nodes of the model present in each attractor but the ones that oscillate are marked with "X" by default. This dict can easily be turned into a data frame (to export and such), and the function can be used separately as well. I'm using this attr_dict to add attractor leaves to the networkx succession diagrams. This will simplify significantly that part of the code and I think makes things a bit more accessible.
These changes are pushed into the isse38 branch (not yet merged). I also think this can be adapted if the attractor class is done.
Please let me know if you have any thoughts or suggestions.
This thread is out of date, as we have added in the AttractorRepertoire class. I will rephrase the issue with our updated organization:
In the Export.py module, we want a function that takes an AttractorRepertoire object as input and returns a pandas data frame. That contains (at a minimum) the contents of the AttractorRepertoire.summary() output.
I don't know who added this or when, but we have sm.Export.attractor_dataframe that does exactly this, so I'll close the issue.
This should be a thread where we discuss the user experience issues and possible improvements.