Discuss format for/contents of output "trajectory" file

bannanc commented 8 years ago

I'm currently working on adding a "trajectory" file that will track changes in which "atomtypes" are saved with information for each of them.

Below are the things we're planning to include now, if you have suggestions please let us know here so we have a record of it associated with the code.

iteration number, index, SMARTS, reference types, parent, matches, molecules, 'something for scoring'

We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.

bannanc commented 8 years ago

Another thing, this sparked, was that our 0th iteration should probably be the starting set of parameters before any changes are made and then iterations 1 to total number each have a changes attempted.

davidlmobley commented 8 years ago

@bannanc : Can you clarify this point:

We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.

I was there for the discussion yesterday so I THINK I get what you're saying, but not having looked at the code yet it's not totally obvious. You may need to explain "reference type" for me, and when you say that should be "a list of atom types" this superficially seems like it's going against what you said about moving away from it being just atomtypes. I'm guessing you mean that it will be a list of generalized matches or something along those lines?

Maybe break it out into a couple bullet points about where we are now, and a separate set about where we would need to go? Thanks.

jchodera commented 8 years ago

I'm currently working on adding a "trajectory" file that will track changes in which "atomtypes" are saved with information for each of them. iteration number, index, SMARTS, reference types, parent, matches, molecules, 'something for scoring'

That's actually a ton of data if you want to store all the molecules each iteration. @cbayly13 may know of some compressed 2D molecule-storage formats we can use for that, but it also might be overkill since you can regenerate the typed molecules from the atom type definitions.

I'd advise steering clear of text format, since you'd have to write a formatter and a parser. If we could just pickle the objects you want each iteration, that might be easier. Maybe cPickle, pytables, or numpy would let you easily do that if you're planning to do the data analysis in Python too?

davidlmobley commented 8 years ago

I'd advise steering clear of text format, since you'd have to write a formatter and a parser. If we could just pickle the objects you want each iteration, that might be easier. Maybe cPickle, pytables, or numpy would let you easily do that if you're planning to do the data analysis in Python too?

👍 for this. I know @cbayly13 seems to have a preference for .csv, but since I never do anything with spreadsheets if I can avoid it, I don't like it. Maybe we store it in a pythonic way, @bannanc , then give him an option to dump to csv if he really insists (I believe numpy has a nice "dump everything to text" routine whose name I forget at the moment that would probably do this for him).

jchodera commented 8 years ago

We're slowly trying to move away from smarty being just atomtypes, as the chemical perception will eventually need to handle atoms, bonds, angles, and torsions. That means reference type will be a list of atom types and atoms should be general matches. For now the scoring will likely be the fraction of reference types matches since we're still comparing to current atom types.

I had originally intended to write a completely different piece of code to do that, since this code has no facility for sampling over parameters, and building this in would be difficult since this capability is not modular with the current design.

jchodera commented 8 years ago

Another thing, this sparked, was that our 0th iteration should probably be the starting set of parameters before any changes are made and then iterations 1 to total number each have a changes attempted.

You can add a write_current_sampler_state() function to write a "snapshot" of the sampler state (and associated data) to your trajectory, and make sure to call this before starting sampling but after computing the initial score.

jchodera commented 8 years ago

@cbayly13 seems to have a preference for .csv,

How would you cram a list of atom types---where the number might vary---into a fixed-number-of-columns CSV format?

jchodera commented 8 years ago

Regarding expanding to sampling over bond, angle, and torsion types: I think we need to start a new thread on this to discuss how we should propose child SMIRKS types for bonds, angles, and torsions from parent types. We need to come up with an initial way to do this before we can move ahead implementing that.

bannanc commented 8 years ago

@jchodera Ok, in that case, you can ignore my reference type comment.

I think most of what @cbayly13 and I were hoping for is just an organized output file so it wasn't just writing to the command line and then disappearing. It's possible that adding an issue here is over complicating it

cbayly13 commented 8 years ago

Guys this thread is really mushrooming! The traj file we were talking about was simply getting easier access to the data stored in the tables in stdout. Not saving the stom-typed states of all the molecules. From the tables we (presumably) have the info we need to regenerate a desired atom-typed set of molecules.

cbayly13 commented 8 years ago

csv file of the tables could be pulled into pandas, then sliced to home in on a specific state of parameters (=smarts) and then operated on to analyze various ways. That's what I thought we were thinking about. The csv files would be fairly small but contain a lot of useful data, e.g. how many times did a specific smarts string appear?

jchodera commented 8 years ago

Let's open new issues for orthogonal topics, but discuss the "trajectory" writer here.

What do you want to do with the trajectory after it's written? Look at it by eye? Plot it? Analyze some statistics? That will help us figure out the best format.

jchodera commented 8 years ago

I'm not sure how we would represent a variable-length list of SMARTS strings and their human readable descriptions in either a CSV file or a pandas table. We could have a different table for each iteration, but then it is hard to do the statistical analysis you are discussing.

I think some kind of pickling that would let you write a small block of code to analyze statistics and build pandas tables as output during the analysis stage might be best, but if you have some more concrete ideas about what you want to look at, another format might be better.

davidlmobley commented 8 years ago

@jchodera - OK, so we just talked about this internally and really, Caitlin and Christopher already worked out what they needed for the current problem internally and have it working. Basically they just needed something that they could store to a suitable file rather than having to scroll up through things that were printed to the screen, and they now have it sorted out how to get exactly what they want.

There are larger issues that are starting to come up in this thread about what we want to design for the future, and I think we can make those into separate issues when it's time to discuss them. But, they are not needed yet.

So, I'll mark this as closed but flag it for me to come back to it in a few days so we can address design issues that came up here as separate issues. Thanks.

jchodera commented 8 years ago

I don't see a PR yet, so let's keep this open until the issue is really addressed with a merged PR.

Also, there is nothing to be handled "internally" here. This is a collaboration, so let's keep it collaborative. Good communication will help make that possible, and that involves good etiquette with issues to discuss features and pull requests to discuss their implementation. I know it seems like it slows things down for now, but it will make everything easier to start this off on the right foot by trying to be good about the fork-pull-merge model and one-topic-per-issue rule.

davidlmobley commented 8 years ago

OK, that's fine, @jchodera .

Another issue to worry about a bit, though, is how do we make sure you focus your time/effort on the things which are a bottleneck for us -- the things that only you can do well -- and not on things that we can easily just handle for now on our own. We absolutely don't want to cut you out of anything, and we absolutely need you for significant design decisions - but as long as we're waiting on certain things from you, it doesn't make a whole lot of sense to us to have you involved with every detail of every minor thing we might want to look at. If you have to spend all your time dealing with a trajectory format we temporarily want just to store what Smarty is doing, and things of comparable importance, instead of the API design issues or getting SMIRFF XML working then that's going to be a huge loss. Maybe we can discuss how to avoid these potential pitfalls in our Tuesday call.

This specific issue was REALLY just about how to save information Smarty is already printing to screen so we can (a) look at it again later, or (b) later do statistics on it if we decide for some reason we need to do that. Christopher is currently just dumping it to a text file, which means it is ONLY useful for a human to scroll, so we realized we should do something to make it possible that we can parse it automatically. Then we also added some thoughts about additional info we might want saved, such as what parent types these came from, etc.

If you really insist, we can of course discuss this in great detail and I can get Caitlin and Christopher to explain exactly what they want to do here. But as I said, I'm just worried that if we spend too much time on it, that will keep you from getting us the things we can't do on our own. We need some way to avoid that.

davidlmobley commented 8 years ago

@jchodera - I absolutely do take your point though about one-topic-per-issue though and we'll try to do a better job of that in the future, as well as making sure our issues more clearly explain the context and what we are trying to do. I think this issue was helpful to @bannanc in clarifying what NOT to do. :)

We're already sold on the fork-pull-merge model.

bannanc commented 8 years ago

Yes, I'm definitely still on the learning curve with gitHub issues, I think the immediate part of this was straight forward and I introduced potential issues that we've think might come up, but shouldn't have been on this issue.

jchodera commented 8 years ago

No need to wait for me on little things. But I can't participate if you don't include me. Even just monitoring the traffic and chiming in occasionally is valuable---I will know what is going on then, and can intervene when needed.

You can assign issues to me that you need me to tackle.

davidlmobley commented 8 years ago

@jchodera - sounds good. Should we use some sort of tag or label too to indicate whether we think it's something that's only a minor "we can proceed without much input" thing versus a "design decision" or "significant functionality" issue? i.e. maybe we should just make "major" and "minor" labels and maybe "design decision"? Or should it just be about whether we tag you?

jchodera commented 8 years ago

Just self-assign issues you plan to tackle.

jchodera commented 8 years ago

Labels sound great too.

openforcefield / smarty

Discuss format for/contents of output "trajectory" file #3