Open prjemian opened 8 years ago
the initial email from Brian:
...
Attached is a short bit of code to provide a dictionary that tracks versions of
Python packages for provenance tracking as well as a a few things that we
track in GSAS-II. I am not sure if there is an easy way to also introspect the
.dll/.pyd files that an application uses, but it would be worthwhile to see
... any additional ideas.
I am not sure this needs to be a stand-alone .py file, but I am also not sure
where it might make sense to stick this in our skeleton application.
Pete's reply:
How common is this particular code used in various Python Packages?
What is *its* provenance (where did you get it)?
Compare this with:
https://www.euroscipy.org/2015/schedule/presentation/16/
If I understand how this is used, the module is added to project
and is then called upon when writing a data file, to add additional
info to that file describing the suite of packages in use at the time.
Pete also wrote about the typical git workflow for collaborators to contribute:
In git, you would:
Brian's response to "from where did this code come"?
For GSAS-II we use something much cruder — with hardcoded package names
for the things we track, but I cannot tell you how valuable it is to store this sort of
info in a project file for when someone sends us something to debug.
I just wrote what I sent.
Also:
ReciPy looks very interesting, but is also a much heavier-weight package and
my quick reading is that it only looks for certain packages that it cares about.
I am not sure how parallel they are, but there are probably some cool things
to [consider] from that.
Pete commented on Brian's code example:
Your code is a good example to refactor from a method into a class.
If I correctly understood its intent.
Brian wrote to Doga:
Doga,
A while back you raised the issue of provenance tracking in the context
of next practices for package creation. I’d like to ask you to revisit this to
better evaluate what is needed. A prototype package for APS projects has
been created (http://pyprototype.readthedocs.org/en/latest/ and
https://github.com/prjemian/PyPrototype). As best as I can tell, this takes care
of the problem of interacting with Github to establish provenance of code from
the current repository, but as you know tracking external package versions is
also quite important. Towards that, Pete has located the following and I wrote the attached routine.
* https://github.com/recipy/recipy
* https://www.euroscipy.org/2015/schedule/presentation/16/
My feeling is that ReciPy is both too limited (tracked packages are hard-coded)
and too heavy-weight for our use and I don’t see why one would use that and
versioneer, but my code should also likely do more than it does now. I’d like to
ask you to review package/computing environment provenance and see what
else might be needed in an APS-tailored package. It would be great to have you
contribute code to the prototype, but even coming up with a list of useful features
to steal from ReciPy would be of value. On a related note, I have struck out on
finding introspection mechanisms for profiling the versions of the most important
.so/.dynlib/.dll libraries relevant to a Python app’s results, but if you have any ideas
on that this would be useful. Even knowing the most relevant library names by
platform could be of value.
Brian
provenanceTracker.py
'''Code to record provenance information for a Python app
This assumes that all significant imports have been done before
routine provenanceTracker.provenanceTracker() is called.
'''
import sys
import platform
__version__ = '0.0.1'
def provenanceTracker():
'''Provides a dict listing versions of imported packages
:returns: dict where key is name of package and value is the
__version__ string or for a few known outliers some other variable
that indicates the version.
'''
PackageVersions = {}
PackageVersions['Python'] = sys.version.split()[0]
PackageVersions['Platform'] = sys.platform+'|'+platform.architecture()[0]+'|'+platform.machine()
for name,pkg in sys.modules.iteritems():
try:
PackageVersions[name] = pkg.__version__
continue
except AttributeError:
pass
# deal with a few known ideosyncratic packages
if name == 'Image':
PackageVersions[name] = pkg.VERSION
elif name == 'PIL':
PackageVersions[name] = pkg.PILLOW_VERSION
return PackageVersions
# test this by calling it directly
if __name__ == '__main__':
import provenanceTracker
import matplotlib as mpl
import sys
import PIL
import numpy
import Image
provenance = provenanceTracker.provenanceTracker()
for p in sorted(provenance): print p,provenance[p]
Brian wrote back:
On Mar 28, 2016, at 11:58 AM, Pete Jemian <jemian@anl.gov> wrote:
>
> Your code is a good example to refactor from a method into a class.
If you get a chance, I’d like to have you explain to me why you would do this,
so I learn more. I have always believed in writing the most simple code that
gets the job done, so my feeling would be to stick with a simple function
unless a class adds more features, simplifies use or maintenance.
Pete responds to Brian:
It all depends on how it is intended to be used.
Doga Gursoy (welcome to the discussion) wrote:
For prototyping this is also an easy way: https://cookiecutter.readthedocs.org
Specifically: https://github.com/audreyr/cookiecutter-pypackage
Now that the discussion is up to date, I'll continue...
What is the intended purpose of the addition of provenance code in this prototype?
Will this method be called more than once each time the python package is used?
CookieCutter looks like a very useful tool. Different, I believe, than Brian's idea of provenance tracking.
Thinking of that old lesson: do the fishing or teach the fishing, the CookieCutter project and this PyPrototype project are on opposite ends of the lesson.
The PyPrototype project demonstrates the layout of a prototypical Python project. It is on the end of teaching how to fish. There will be some find/replace work to change each new copy of the prototype into a useful new project. Maybe that's too much work.
CookieCutter is very much on the end where the fishing is actually done. One uses it to create a new skeleton Python project with all the right names and such (or some other metaphorical cookie shape) according to a customized template.
Consider this: the PyPrototype project shows the pattern of the end result.
We could create a CookieCutter template to recreate the steps that provide the customized project.
What is the intended purpose of the addition of provenance code in this prototype?
I see this as allowing people to recreate the code environment that gave a particular result. I am assuming that thanks to versioneer, one knows what version the current code one is running (perhaps that should be integrated in provenanceTracker.py.) but if a result is changed by for example a change in numpy, how does one track/recreate that?
I envision calling the one function in provenanceTracker.py before saving output. By including the returned dictionary one would document as much as possible of the software stack.
Can we adopt PEP8 standards for the new projects? https://www.python.org/dev/peps/pep-0008/
Two things I have noticed:
Some pep8 standards but not all of them.
For example, errors on "E221 multiple spaces before operator" are just goofy. Sometimes we humans want to line up the equal signs in a block of assignments (such as init.py).
Trailing whitespace on a line (W291) is benign "E402 module level import not at top of file" flagged the init.py again for its handling of versioneer.
Mostly, pep8 is advisory but should be taken with a healthy skepticism. Another code, pylint, has a differing opinion and provides a better diagnostic (IMO) of code problems. It gives a score that can be used to measure improvement against a recommendation. Some projects require a minimum score for acceptance. The init.py file has a bad score: -0.67/10. I can improve that probably to +5/10 (or so I hope).
I hate W291
Is there a way to configure in automatic checking of some PEP8 standards?
[pep8]
ignore = W291
conda install pylint
much more valuable feedback and coaching from this tool than pep8
https://codeclimate.com is also useful and does this and some other error and readability checks for you.
@nicholas-aps does any project you've been involved use provenance tracking for data processing? Any ideas or suggestions?
One project (of which I am aware) actively tracks provenance:
OK, I take that back. The Irena IgorPro macros maintained by Jan Ilavsky for SAS data maintain a journal record as part of the IgorPro project file. This is behind the scenes provenance logging.
Here, the provenance is recorded as data values in "wavenotes" (metadata string connected to every IgorPro array data object) and as a data processing activity logging notebook that Jan created within an IgorPro notebook structure.
Irena: http://usaxs.xray.aps.anl.gov/staff/ilavsky/irena.html
Otherwise, it has been discussed in two data standards projects: NeXus : http://download.nexusformat.org/doc/html/search.html?q=provenance canSAS : http://cansas-org.github.io/canSAS2012/search.html?q=provenance
The most progress of these two was to assert the desire and importance of documenting provenance and to establish a location within a NeXus file to record it. That location would be within a NXprocess group (an event of data processing, reconstruction, or analysis) as a NXnote group.
:NXprocess: http://download.nexusformat.org/doc/html/classes/base_classes/NXprocess.html
:NXnote: http://download.nexusformat.org/doc/html/classes/base_classes/NXnote.html
Pete
On 3/31/2016 11:33 AM, Doga Gursoy wrote:
@nicholas-aps https://github.com/nicholas-aps does any project you've been involved use provenance tracking for data processing? Any ideas or suggestions?
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/prjemian/PyPrototype/issues/4#issuecomment-204011548
Brian Toby started an email discussion on Provenance tracking in Python. He included code that would be an enhancement for this prototype project.
To contribute that code, make a pull request.
Here follows the discussion...