collect bibliography from source files

johnyf commented 7 years ago

In tulip, full bibliographic references are placed in the file doc/bib.txt from which the user guide bibliography is generated automatically as described in the tulip developer's guide. References from within individual source files mention keys from the centralized bibliography doc/bib.txt.

Over time there have been private discussions about this approach. I prefer to place plain-text full references in each source file (example). The purpose is to reduce the coupling between different modules/subpackages/developers introduced by "importing" a shared bibliography.

Placing the reference entries in the source code file makes the code file independent of the bibliography file. Also, the source code file is readable in plain text, without running any documentation generation tool. In that sense, it is self-contained, and independent of which package the module is part of. Relocating a module from one package to another doesn't require one to not forget to move the references too (something I could forget).

I have debated the potential duplication with myself and decided that in this particular case I don't mind it. A posteriori I observe that no duplication ever arose anywhere (or so I think). There is a reason: we should not implement twice the same algorithm, but instead import and reuse it, whenever possible. Slightly more elaborated, algorithms that are sufficiently complex or specialized to justify citing a reference are usually implemented at one place and carefully (well, alternatives in Python, C, etc. may introduce duplication, but the references may appear only in one of the implementations--the most readable).

The question that arises naturally from these remarks is how to obtain a centralized bibliography from a decentralized one. I would like to propose automatically collecting references from individual source files. The format of entries in each file need not be strict. The most robust way is probably to mention the DOI or a URL wherever applicable, and raise warnings during bibliography collection if a bibliographic entry cannot be confirmed from online archives. Checking entries introduces the requirement for access to the internet, but it can be optional. Also, warnings can be generated when similar entries are detected (title match or lexical distance).

From a search the package duecredit appears to be for collecting references from source files. I haven't evaluated it yet.

carterbox commented 7 years ago

I agree that keeping the citations with the implementation makes more sense. However if the citations are provided in bibtex format instead of some styled plaintext such as MLA or Chicago, they are easier to reuse or grab for print formatting.

johnyf commented 7 years ago

I understand the motivation for BibTeX syntax, and agree with the motivation. However, my main motivation is:

Readability in source form. Plain text is somewhat more readable than BibTeX. When reading source files (from the disk, or browsing GitHub), plain text in a docstring is easier for a human to parse.
Few to no constraints when writing. I do not want an additional constraint of whether all fields are present, errors from missing fields, enforcing some style, etc. There should be enough information for the reader to find the resource, but be permissive with respect to syntax. I don't support the opposite (entirely ad hoc style), but a reasonable listing of authors, title, conference/journal/institution name, date, and preferably vol/issue/pages.

I think permissiveness is the main objective.

Examining the purpose of using BibTeX, it is mainly to ease parsing. Parsing plain text will be more difficult. A middle solution is:

include DOI, so scanning docstrings for DOI URLs will suffice for most entries. This is a simple lexical search.
If DOI not known, then either use plain text or BibTeX. Separate entries by blank lines, so that splitting into separate entries be easy. Try to parse as BibTeX. If that fails, then keep the plain text as is. Further processing can try to use the first line as authors and second as title. If that works, it can report possible duplicates (unlikely to occur).

Regular expressions should suffice for this approach, together with a convention of placing references at the bottom of a docstring, above any epytext.

Regarding duecredit, it requires altering code by importing duecredit and using its decorator or functions. I prefer an external approach that only analyzes source code. The problem that duecredit attempts to address is to collect references relevant to the call graph that arises from specific users. It is an interesting case of introspection, but different.

johnyf commented 6 years ago

Addressed in ed5abb6ca47731931f26d3742a7dadafa4f067d6.

Detecting references sprinkled in natural language form throughout docstrings is a task for a dedicated package in another area of research. It is simpler and explicit (PEP 20) to collect all references in a BibTeX file, similarly to the approach in the package tulip. Using \cite in docstrings is reasonable, and familiar (tool support for such docstring markup is a future task). The choice of BibTeX key can to some degree reveal which reference is mentioned.

Regarding the concern expressed above about (rare) migration of modules from a package to another, the references that need to follow suit can be detected by comparing which keys occur in \cite within docstrings but are missing from the BibTeX file of the new host package.

tulip-control / polytope

collect bibliography from source files #43