Request to write depedency information

I am the maintainer of latexmk, and the next version will have the ability to be configured to support pythontex properly, without the previous limitations. (This was provoked by a user request).

It would help latexmk (and other similar tools that automate the compilation of LaTeX files) if there were an easy way to determine the dependencies of a pythontex run. For example, if the user's code executes

pytex.add_dependencies("data.txt")

it would be useful for latexmk itself to know the file data.txt is a source file (i.e., a dependency) for the pythontex run, and thus a change in the file would cause latexmk to rerun pythontex.

The information is evidently written by pythontex to the pythontex_data.pkl file, but that is a binary file, and probably not appropriate for parsing by other software. So I would like to suggest that, for each dependency, python writes a line in a standardized format to the .pytxmcr file, say

%PythonTeX dependency: 'data.txt'

It's a TeX comment, so it doesn't affect what happens when the .pytxmcr file is read in by the document during a LaTeX compilation.

I'm working on the next release, and will add this to the list of new features. I agree that adding the data in a comment in the .pytxmcr file is a logical approach.

A few additional considerations.

Dependency paths have a leading ~ expanded to the user directory with Python's os.path.expanduser(). This works under all operating systems, including Windows.
Currently, paths are saved with the original user-provided path component separator (/ or \). On subsequent runs, paths are always normalized to the current operating system after being loaded, using Python's os.path. I expect it would be best to save all paths to the .pytxmcr file with normal slashes, and then leave it up to build tools to normalize to the current operating system in the event that something needs to be done under Windows that actually requires a backslash.
Dependencies are tracked by modification time by default, but this can be switched to SHA1. Would you want that additional information for latexmk? If so, it might make sense to put the dependencies in a simple format like JSON. Maybe something like this:
```
%BEGIN PythonTeX dependencies
%{
%    "data.txt": {"mtime": 1539611639.2679114},
%    ...
%}
%END PythonTeX dependencies
```
Or, for the SHA1 case:
```
%BEGIN PythonTeX dependencies
%{
%    "data.txt": {"hash.sha1": , "e51b38284b3e9b932e16685ef9344effac3fce3c"},
%    ...
%}
%END PythonTeX dependencies
```

I'm working on the next release, and will add this to the list of new features. I agree that adding the data in a comment in the .pytxmcr file is a logical approach.

Excellent.

Dependency paths have a leading ~ expanded to the user directory

That's good. What matters for using a filename is that the filename should be usable as it stands in standard calls for opening files etc.

One other thing that does matter is about relative filenames. Are they relative to the working directory when pythontex was started or relative to pythontex's working directory? The issue particularly arises if I run latex engine with an option to set the auxiliary or output directory. Suppose I do

pdflatex -output-directory=output test.tex

and then run

pythontex output/test

Then pythontex looks for dependency files relative to the output directory. (As is mentioned in other thread(s), this does not seem to be optimal behavior.) If a dependency file is named as "data.txt" in the .pytxmrc, it is not obvious to a program reading the file whether data.txt is in the top level directory or in the output directory.

One possibility is to always provide the absolute pathname, like MiKTeX normally does in its .fls and .log files. It would also be possible to specify pyththontex's working directory in a separate comment line in the .pytxmrc file. It would be preferable, I think, for relative pathnames to be relative to the initial directory on entry to pythontex. Then the interpretation is clear to a reader of the dependency information without knowing the internal details of what pythontex does about its internal working directory.

I have a preference to use relative pathnames instead of absolute pathnames, when that is possible. (There's a long story here ...!)

Currently, paths are saved with the original user-provided path component separator (/ or \). On subsequent runs, paths are always normalized to the current operating system after being loaded, using Python's os.path. I expect it would be best to save all paths to the .pytxmcr file with normal slashes, and then leave it up to build tools to normalize to the current operating system in the event that something needs to be done under Windows that actually requires a backslash.

It's nice to have a consistent convention, and I have found normal slashes '/' are best. Those do work on MSWindows in all but certain genuinely exotic situations, provided filenames are quoted on command lines. As you say, build tools should expect to do their own normalization. (Latexmk normalizes filenames to '/'.)

Dependencies are tracked by modification time by default, but this can be switched to SHA1. Would you want that additional information for latexmk? If so, it might make sense to put the dependencies in a simple format like JSON. Maybe something like this:
      %BEGIN PythonTeX dependencies
      %{
      %    "data.txt": {"mtime": 1539611639.2679114},
      %    ...
      %}
      %END PythonTeX dependencies
  Or, for the SHA1 case:
  ```
  %BEGIN PythonTeX dependencies
  %{
  %    "data.txt": {"hash.sha1": , "e51b38284b3e9b932e16685ef9344effac3fce3c"},
  %    ...
  %}
  %END PythonTeX dependencies
  ```

Latexmk doesn't need the hashes, since it has to compute its own for all other dependencies. It would need extra code to use something supplied by pythontex. But other tools might find the information useful.

The format is not critical as long as it is stereotyped, easy to parse, documented, and held stable (as much as possible) in the future.

JSON's method would make sense if one has an idea that enhancements are likely to be needed in the future.

On the other hand, a format along the lines I suggested

%PythonTeX dependency: 'data.txt'

is more robust against having special characters in the filename.

A further thought about the format of the dependency information in the .pytxmrc file: I think this may be a case for the KISS principle (i.e., Keep It Simple). That is, just write a simple sequence of comment lines, for example in the format I suggested. Only switch to the fancier JSON format (or something else) if after experience it is found that something more general is needed.

Note that the latex engines write dependency information to an .fls file as sequence of lines for INPUT files, OUTPUT files, etc, just containing filenames. That's sufficed so far.

gpoore / pythontex

Request to write depedency information #135