greenelab / covid19-review

A collaborative review of the emerging COVID-19 literature. Join the chat here:
https://gitter.im/covid19-review/community
Other
116 stars 81 forks source link

Plotting manuscript progression for methods manuscript #952

Open agitter opened 3 years ago

agitter commented 3 years ago

For the ACM-BCB submission we could plot the following manuscript statistics over time:

All of these are available in the files variables.json and references.json in the output branch of the repo. Some quick Python experimentation shows how to access these values and the corresponding date:

import json
variables = json.load(open('variables.json'))
>>> variables['pandoc']['date-meta']
'2021-04-28'
>>> variables['manubot']['date']
'April 28, 2021'
>>> len(variables['manubot']['authors'])
49
>>> variables['manubot']['manuscript_stats']['word_count']
131943
references = json.load(open('references.json', encoding='utf-8'))
>>> len(references)
1428

I don't know the most efficient way to get these for every commit in the output branch. However

git log --pretty=format:"%h" > output-commits.txt

dumps a list of all commits to a text file that we could iterate over.

Pseudocode for an algorithm could look like:

Doing this with a Python script would be messy due to the subprocess calls to issue git commands, but it's possible and I don't know the GitPython package well enough to do it that way. For example

subprocess.run(["git", "checkout", "3839cc2e"])

will checkout a specific commit from the output branch.

rando2 commented 3 years ago

Thank you so much @agitter! I'm going to try to finish getting the text cleaned up (ish), then send it to authors and start working on parsing #17, and then hopefully take this on in the morning!

rando2 commented 3 years ago

Initial prototype is working 👍 Thanks for the point in the right direction, @agitter, those json suggestions make it WAY easier than what I was imagining (which involved a lot of regex).

This is obviously EXTREMELY ROUGH and needs to be visually cleaned up in basically every way, but it is a graph of the data!

Screen Shot 2021-04-30 at 8 16 13 AM
agitter commented 3 years ago

That's amazing!

We should be able to flip the order of the dates and show fewer x-axis ticks (e.g. monthly) without too much trouble.

Can we account for the big spikes in word count? My first guess is that the initial big spike was adding the reviews as an appendix. Then the other sharp increase and decrease could be when you duplicated text to convert from a single paper to multiple papers, but I don't know whether the timing matches.

Nevertheless, having the data plotted is very cool and helps make the point that a git-managed manuscript enables lots of inspection and analysis that is impossible with a typical writing process. Maybe not "impossible" with LaTeX, but it would be painful to analyze every commit without having these stats ready to go in json files.

rando2 commented 3 years ago

Can we account for the big spikes in word count?

Yes! The first one is when we added the appendix and the second one is likely when we accidentally duplicated the appendix 😆 Unfortunately this makes it super clear that it sat there duplicated for a long time before anyone noticed! I think most of the text I duplicated for the manuscript splitting process is still duplicated (since I use blame pretty heavily while adding the attributions of text that I moved between documents!)