Provenance tracking - Githubissues

prjemian commented 8 years ago

Brian Toby started an email discussion on Provenance tracking in Python. He included code that would be an enhancement for this prototype project.

To contribute that code, make a pull request.

Here follows the discussion...

prjemian commented 8 years ago

the initial email from Brian:

...

Attached is a short bit of code to provide a dictionary that tracks versions of 
Python packages for provenance tracking as well as a a few things that we 
track in GSAS-II.  I am not sure if there is an easy way to also introspect the 
.dll/.pyd files that an application uses, but it would be worthwhile to see 
... any additional ideas. 

I am not sure this needs to be a stand-alone .py file, but I am also not sure 
where it might make sense to stick this in our skeleton application.

prjemian commented 8 years ago

Pete's reply:

How common is this particular code used in various Python Packages?
What is *its* provenance (where did you get it)?

Compare this with:
https://www.euroscipy.org/2015/schedule/presentation/16/

If I understand how this is used, the module is added to project 
and is then called upon when writing a data file, to add additional 
info to that file describing the suite of packages in use at the time.

prjemian commented 8 years ago

Pete also wrote about the typical git workflow for collaborators to contribute:

In git, you would:

fork the prjemian/PyPrototype repository to briantoby/PyPrototype
- easiest to do this with the GitHub web interface
- there's a "fork" control at the top of the page
clone briantoby/PyPrototype from GitHub to your Mac
add the proposed file
- copy that file into /src/PyPrototype/
- commit this change on your hard disk
- push your local repo back to GitHub: briantoby/PyPrototype
make a "pull request" in prjemian/PyPrototype
I'll look at your "PR"
we'll talk via GitHub's issue management service
end result: either I merge your PR or you close the PR

prjemian commented 8 years ago

Brian's response to "from where did this code come"?

For GSAS-II we use something much cruder — with hardcoded package names 
for the things we track, but I cannot tell you how valuable it is to store this sort of 
info in a project file for when someone sends us something to debug. 

I just wrote what I sent.

Also:

ReciPy looks very interesting, but is also a much heavier-weight package and 
my quick reading is that it only looks for certain packages that it cares about. 
I am not sure how parallel they are, but there are probably some cool things 
to [consider] from that.

prjemian commented 8 years ago

Pete commented on Brian's code example:

Your code is a good example to refactor from a method into a class.  
If I correctly understood its intent.

prjemian commented 8 years ago

Brian wrote to Doga:

Doga, 

  A while back you raised the issue of provenance tracking in the context 
of next practices for package creation. I’d like to ask you to revisit this to 
better evaluate what is needed. A prototype package for APS projects has 
been created (http://pyprototype.readthedocs.org/en/latest/ and 
https://github.com/prjemian/PyPrototype). As best as I can tell, this takes care 
of the problem of interacting with Github to establish provenance of code from 
the current repository, but as you know tracking external package versions is 
also quite important. Towards that, Pete has located the following and I wrote the attached routine. 
* https://github.com/recipy/recipy
* https://www.euroscipy.org/2015/schedule/presentation/16/

   My feeling is that ReciPy is both too limited (tracked packages are hard-coded) 
and too heavy-weight for our use and I don’t see why one would use that and 
versioneer, but my code should also likely do more than it does now. I’d like to 
ask you to review package/computing environment provenance and see what 
else might be needed in an APS-tailored package. It would be great to have you 
contribute code to the prototype, but even coming up with a list of useful features 
to steal from ReciPy would be of value. On a related note, I have struck out on 
finding introspection mechanisms for profiling the versions of the most important 
.so/.dynlib/.dll libraries relevant to a Python app’s results, but if you have any ideas 
on that this would be useful. Even knowing the most relevant library names by 
platform could be of value. 

Brian

prjemian commented 8 years ago

provenanceTracker.py

'''Code to record provenance information for a Python app
This assumes that all significant imports have been done before
routine provenanceTracker.provenanceTracker() is called.
'''

import sys
import platform

__version__ = '0.0.1'
def provenanceTracker():
    '''Provides a dict listing versions of imported packages

    :returns: dict where key is name of package and value is the
      __version__ string or for a few known outliers some other variable
      that indicates the version.
    '''
    PackageVersions = {}
    PackageVersions['Python'] = sys.version.split()[0]
    PackageVersions['Platform'] = sys.platform+'|'+platform.architecture()[0]+'|'+platform.machine()
    for name,pkg in sys.modules.iteritems():
        try:
            PackageVersions[name] = pkg.__version__
            continue
        except AttributeError:
            pass
        # deal with a few known ideosyncratic packages
        if name == 'Image':
            PackageVersions[name] = pkg.VERSION
        elif name == 'PIL':
            PackageVersions[name] = pkg.PILLOW_VERSION
    return PackageVersions

# test this by calling it directly
if __name__ == '__main__':
    import provenanceTracker
    import matplotlib as mpl
    import sys
    import PIL
    import numpy
    import Image
    provenance = provenanceTracker.provenanceTracker()
    for p in sorted(provenance): print p,provenance[p]

prjemian commented 8 years ago

Brian wrote back:

On Mar 28, 2016, at 11:58 AM, Pete Jemian <jemian@anl.gov> wrote:
>
> Your code is a good example to refactor from a method into a class. 

If you get a chance, I’d like to have you explain to me why you would do this, 
so I learn more. I have always believed in writing the most simple code that 
gets the job done, so my feeling would be to stick with a simple function 
unless a class adds more features, simplifies use or maintenance.

prjemian commented 8 years ago

Pete responds to Brian:

It all depends on how it is intended to be used.

prjemian commented 8 years ago

Doga Gursoy (welcome to the discussion) wrote:

For prototyping this is also an easy way: https://cookiecutter.readthedocs.org

Specifically: https://github.com/audreyr/cookiecutter-pypackage

prjemian commented 8 years ago

Now that the discussion is up to date, I'll continue...

What is the intended purpose of the addition of provenance code in this prototype?

for people who package python code
for people who use python code for data analysis and wish to document the components of the code at the time of the analysis?
to document what is in the package?

Will this method be called more than once each time the python package is used?

prjemian commented 8 years ago

CookieCutter looks like a very useful tool. Different, I believe, than Brian's idea of provenance tracking.

Thinking of that old lesson: do the fishing or teach the fishing, the CookieCutter project and this PyPrototype project are on opposite ends of the lesson.

The PyPrototype project demonstrates the layout of a prototypical Python project. It is on the end of teaching how to fish. There will be some find/replace work to change each new copy of the prototype into a useful new project. Maybe that's too much work.

CookieCutter is very much on the end where the fishing is actually done. One uses it to create a new skeleton Python project with all the right names and such (or some other metaphorical cookie shape) according to a customized template.

Consider this: the PyPrototype project shows the pattern of the end result.
We could create a CookieCutter template to recreate the steps that provide the customized project.

briantoby commented 8 years ago

What is the intended purpose of the addition of provenance code in this prototype?

I see this as allowing people to recreate the code environment that gave a particular result. I am assuming that thanks to versioneer, one knows what version the current code one is running (perhaps that should be integrated in provenanceTracker.py.) but if a result is changed by for example a change in numpy, how does one track/recreate that?

I envision calling the one function in provenanceTracker.py before saving output. By including the returned dictionary one would document as much as possible of the software stack.

dgursoy commented 8 years ago

Can we adopt PEP8 standards for the new projects? https://www.python.org/dev/peps/pep-0008/

Two things I have noticed:

prjemian commented 8 years ago

Some pep8 standards but not all of them.

For example, errors on "E221 multiple spaces before operator" are just goofy. Sometimes we humans want to line up the equal signs in a block of assignments (such as init.py).

Trailing whitespace on a line (W291) is benign "E402 module level import not at top of file" flagged the init.py again for its handling of versioneer.

Mostly, pep8 is advisory but should be taken with a healthy skepticism. Another code, pylint, has a differing opinion and provides a better diagnostic (IMO) of code problems. It gives a score that can be used to measure improvement against a recommendation. Some projects require a minimum score for acceptance. The init.py file has a bad score: -0.67/10. I can improve that probably to +5/10 (or so I hope).

briantoby commented 8 years ago

I hate W291

Is there a way to configure in automatic checking of some PEP8 standards?

dgursoy commented 8 years ago

http://pep8.readthedocs.org

[pep8]
ignore = W291

prjemian commented 8 years ago

conda install pylint

much more valuable feedback and coaching from this tool than pep8

dgursoy commented 8 years ago

https://codeclimate.com is also useful and does this and some other error and readability checks for you.

dgursoy commented 8 years ago

@nicholas-aps does any project you've been involved use provenance tracking for data processing? Any ideas or suggestions?

prjemian commented 8 years ago

One project (of which I am aware) actively tracks provenance:

OK, I take that back. The Irena IgorPro macros maintained by Jan Ilavsky for SAS data maintain a journal record as part of the IgorPro project file. This is behind the scenes provenance logging.

Here, the provenance is recorded as data values in "wavenotes" (metadata string connected to every IgorPro array data object) and as a data processing activity logging notebook that Jan created within an IgorPro notebook structure.

Irena: http://usaxs.xray.aps.anl.gov/staff/ilavsky/irena.html

Otherwise, it has been discussed in two data standards projects: NeXus : http://download.nexusformat.org/doc/html/search.html?q=provenance canSAS : http://cansas-org.github.io/canSAS2012/search.html?q=provenance

The most progress of these two was to assert the desire and importance of documenting provenance and to establish a location within a NeXus file to record it. That location would be within a NXprocess group (an event of data processing, reconstruction, or analysis) as a NXnote group.

:NXprocess: http://download.nexusformat.org/doc/html/classes/base_classes/NXprocess.html

:NXnote: http://download.nexusformat.org/doc/html/classes/base_classes/NXnote.html

Pete

On 3/31/2016 11:33 AM, Doga Gursoy wrote:

@nicholas-aps https://github.com/nicholas-aps does any project you've been involved use provenance tracking for data processing? Any ideas or suggestions?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/prjemian/PyPrototype/issues/4#issuecomment-204011548

prjemian / PyPrototype

Provenance tracking #4