Closed bannanc closed 8 years ago
Another thing, this sparked, was that our 0th iteration should probably be the starting set of parameters before any changes are made and then iterations 1 to total number each have a changes attempted.
@bannanc : Can you clarify this point:
We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.
I was there for the discussion yesterday so I THINK I get what you're saying, but not having looked at the code yet it's not totally obvious. You may need to explain "reference type" for me, and when you say that should be "a list of atom types" this superficially seems like it's going against what you said about moving away from it being just atomtypes. I'm guessing you mean that it will be a list of generalized matches or something along those lines?
Maybe break it out into a couple bullet points about where we are now, and a separate set about where we would need to go? Thanks.
I'm currently working on adding a "trajectory" file that will track changes in which "atomtypes" are saved with information for each of them. iteration number, index, SMARTS, reference types, parent, matches, molecules, 'something for scoring'
That's actually a ton of data if you want to store all the molecules each iteration. @cbayly13 may know of some compressed 2D molecule-storage formats we can use for that, but it also might be overkill since you can regenerate the typed molecules from the atom type definitions.
I'd advise steering clear of text format, since you'd have to write a formatter and a parser. If we could just pickle the objects you want each iteration, that might be easier. Maybe cPickle
, pytables
, or numpy
would let you easily do that if you're planning to do the data analysis in Python too?
I'd advise steering clear of text format, since you'd have to write a formatter and a parser. If we could just pickle the objects you want each iteration, that might be easier. Maybe cPickle, pytables, or numpy would let you easily do that if you're planning to do the data analysis in Python too?
👍 for this. I know @cbayly13 seems to have a preference for .csv, but since I never do anything with spreadsheets if I can avoid it, I don't like it. Maybe we store it in a pythonic way, @bannanc , then give him an option to dump to csv if he really insists (I believe numpy
has a nice "dump everything to text" routine whose name I forget at the moment that would probably do this for him).
We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.
I had originally intended to write a completely different piece of code to do that, since this code has no facility for sampling over parameters, and building this in would be difficult since this capability is not modular with the current design.
Another thing, this sparked, was that our 0th iteration should probably be the starting set of parameters before any changes are made and then iterations 1 to total number each have a changes attempted.
You can add a write_current_sampler_state()
function to write a "snapshot" of the sampler state (and associated data) to your trajectory, and make sure to call this before starting sampling but after computing the initial score.
@cbayly13 seems to have a preference for .csv,
How would you cram a list of atom types---where the number might vary---into a fixed-number-of-columns CSV format?
Regarding expanding to sampling over bond, angle, and torsion types: I think we need to start a new thread on this to discuss how we should propose child SMIRKS types for bonds, angles, and torsions from parent types. We need to come up with an initial way to do this before we can move ahead implementing that.
@jchodera Ok, in that case, you can ignore my reference type comment.
I think most of what @cbayly13 and I were hoping for is just an organized output file so it wasn't just writing to the command line and then disappearing. It's possible that adding an issue here is over complicating it
Guys this thread is really mushrooming! The traj file we were talking about was simply getting easier access to the data stored in the tables in stdout. Not saving the stom-typed states of all the molecules. From the tables we (presumably) have the info we need to regenerate a desired atom-typed set of molecules.
csv file of the tables could be pulled into pandas, then sliced to home in on a specific state of parameters (=smarts) and then operated on to analyze various ways. That's what I thought we were thinking about. The csv files would be fairly small but contain a lot of useful data, e.g. how many times did a specific smarts string appear?
Let's open new issues for orthogonal topics, but discuss the "trajectory" writer here.
What do you want to do with the trajectory after it's written? Look at it by eye? Plot it? Analyze some statistics? That will help us figure out the best format.
I'm not sure how we would represent a variable-length list of SMARTS strings and their human readable descriptions in either a CSV file or a pandas table. We could have a different table for each iteration, but then it is hard to do the statistical analysis you are discussing.
I think some kind of pickling that would let you write a small block of code to analyze statistics and build pandas tables as output during the analysis stage might be best, but if you have some more concrete ideas about what you want to look at, another format might be better.
@jchodera - OK, so we just talked about this internally and really, Caitlin and Christopher already worked out what they needed for the current problem internally and have it working. Basically they just needed something that they could store to a suitable file rather than having to scroll up through things that were printed to the screen, and they now have it sorted out how to get exactly what they want.
There are larger issues that are starting to come up in this thread about what we want to design for the future, and I think we can make those into separate issues when it's time to discuss them. But, they are not needed yet.
So, I'll mark this as closed but flag it for me to come back to it in a few days so we can address design issues that came up here as separate issues. Thanks.
I don't see a PR yet, so let's keep this open until the issue is really addressed with a merged PR.
Also, there is nothing to be handled "internally" here. This is a collaboration, so let's keep it collaborative. Good communication will help make that possible, and that involves good etiquette with issues to discuss features and pull requests to discuss their implementation. I know it seems like it slows things down for now, but it will make everything easier to start this off on the right foot by trying to be good about the fork-pull-merge model and one-topic-per-issue rule.
OK, that's fine, @jchodera .
Another issue to worry about a bit, though, is how do we make sure you focus your time/effort on the things which are a bottleneck for us -- the things that only you can do well -- and not on things that we can easily just handle for now on our own. We absolutely don't want to cut you out of anything, and we absolutely need you for significant design decisions - but as long as we're waiting on certain things from you, it doesn't make a whole lot of sense to us to have you involved with every detail of every minor thing we might want to look at. If you have to spend all your time dealing with a trajectory format we temporarily want just to store what Smarty is doing, and things of comparable importance, instead of the API design issues or getting SMIRFF XML working then that's going to be a huge loss. Maybe we can discuss how to avoid these potential pitfalls in our Tuesday call.
This specific issue was REALLY just about how to save information Smarty is already printing to screen so we can (a) look at it again later, or (b) later do statistics on it if we decide for some reason we need to do that. Christopher is currently just dumping it to a text file, which means it is ONLY useful for a human to scroll, so we realized we should do something to make it possible that we can parse it automatically. Then we also added some thoughts about additional info we might want saved, such as what parent types these came from, etc.
If you really insist, we can of course discuss this in great detail and I can get Caitlin and Christopher to explain exactly what they want to do here. But as I said, I'm just worried that if we spend too much time on it, that will keep you from getting us the things we can't do on our own. We need some way to avoid that.
@jchodera - I absolutely do take your point though about one-topic-per-issue though and we'll try to do a better job of that in the future, as well as making sure our issues more clearly explain the context and what we are trying to do. I think this issue was helpful to @bannanc in clarifying what NOT to do. :)
We're already sold on the fork-pull-merge model.
Yes, I'm definitely still on the learning curve with gitHub issues, I think the immediate part of this was straight forward and I introduced potential issues that we've think might come up, but shouldn't have been on this issue.
No need to wait for me on little things. But I can't participate if you don't include me. Even just monitoring the traffic and chiming in occasionally is valuable---I will know what is going on then, and can intervene when needed.
You can assign issues to me that you need me to tackle.
@jchodera - sounds good. Should we use some sort of tag or label too to indicate whether we think it's something that's only a minor "we can proceed without much input" thing versus a "design decision" or "significant functionality" issue? i.e. maybe we should just make "major" and "minor" labels and maybe "design decision"? Or should it just be about whether we tag you?
Just self-assign issues you plan to tackle.
Labels sound great too.
I'm currently working on adding a "trajectory" file that will track changes in which "atomtypes" are saved with information for each of them.
Below are the things we're planning to include now, if you have suggestions please let us know here so we have a record of it associated with the code.
iteration number, index, SMARTS, reference types, parent, matches, molecules, 'something for scoring'
We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.