Plotting manuscript progression for methods manuscript

agitter commented 3 years ago

For the ACM-BCB submission we could plot the following manuscript statistics over time:

Number of authors who added themselves to metadata
Number of references
Word count

All of these are available in the files variables.json and references.json in the output branch of the repo. Some quick Python experimentation shows how to access these values and the corresponding date:

import json
variables = json.load(open('variables.json'))
>>> variables['pandoc']['date-meta']
'2021-04-28'
>>> variables['manubot']['date']
'April 28, 2021'
>>> len(variables['manubot']['authors'])
49
>>> variables['manubot']['manuscript_stats']['word_count']
131943
references = json.load(open('references.json', encoding='utf-8'))
>>> len(references)
1428

I don't know the most efficient way to get these for every commit in the output branch. However

git log --pretty=format:"%h" > output-commits.txt

dumps a list of all commits to a text file that we could iterate over.

Pseudocode for an algorithm could look like:

dump all commits on the output branch
foreach output branch commit
checkout the commit in a subprocess
load the json files as shown above and store the commit sha1, manuscript date, author count, word count, and reference count in a row in a dataframe
plot the data from the dataframe

Doing this with a Python script would be messy due to the subprocess calls to issue git commands, but it's possible and I don't know the GitPython package well enough to do it that way. For example

subprocess.run(["git", "checkout", "3839cc2e"])

will checkout a specific commit from the output branch.

rando2 commented 3 years ago

Thank you so much @agitter! I'm going to try to finish getting the text cleaned up (ish), then send it to authors and start working on parsing #17, and then hopefully take this on in the morning!

rando2 commented 3 years ago

Initial prototype is working 👍 Thanks for the point in the right direction, @agitter, those json suggestions make it WAY easier than what I was imagining (which involved a lot of regex).

This is obviously EXTREMELY ROUGH and needs to be visually cleaned up in basically every way, but it is a graph of the data!

agitter commented 3 years ago

That's amazing!

We should be able to flip the order of the dates and show fewer x-axis ticks (e.g. monthly) without too much trouble.

Can we account for the big spikes in word count? My first guess is that the initial big spike was adding the reviews as an appendix. Then the other sharp increase and decrease could be when you duplicated text to convert from a single paper to multiple papers, but I don't know whether the timing matches.

Nevertheless, having the data plotted is very cool and helps make the point that a git-managed manuscript enables lots of inspection and analysis that is impossible with a typical writing process. Maybe not "impossible" with LaTeX, but it would be painful to analyze every commit without having these stats ready to go in json files.

rando2 commented 3 years ago

Can we account for the big spikes in word count?

Yes! The first one is when we added the appendix and the second one is likely when we accidentally duplicated the appendix 😆 Unfortunately this makes it super clear that it sat there duplicated for a long time before anyone noticed! I think most of the text I duplicated for the manuscript splitting process is still duplicated (since I use blame pretty heavily while adding the attributions of text that I moved between documents!)

greenelab / covid19-review

Plotting manuscript progression for methods manuscript #952