neuropoly / bibeasy

Set of tools to manage academic bibliography
Apache License 2.0
0 stars 0 forks source link

Reformatting from gsheet to neuro.polymtl.ca #3

Closed namgo closed 1 year ago

namgo commented 1 year ago

Hi all,

Just wanted to give a quick update on my work today.

After making sure I could reproduce the github actions configuration locally in a container (nothing super complicated), I did some research into what I believe is the best path forward to go from the provided gsheet to markdown such that it can be parsed by publishing action within github.

Pandoc has made improvements from where they were a few years ago[1], it appears trivial to take a bibtex file and convert it to markdown[2]. Bibeasy shouldn't need to be extended very much for the gh-actions system, I might set up the gsheet URL as a gh secret entry or environment variable - this isn't strictly necessary because it is a publicly read-only document but it would be a little more tidy.

scripts/bibeasy_cli.py to scripts/csv2bibtex.py to pandoc might be sufficient. Outputting to an intermediary csv is mostly necessary if I don't modify anything in bibeasy itself. In the publish.yml gh-actions workflow I think it would be enough to add a csv generation routine, a csv to bibtex routine, and then push that to pandoc, but I'll confirm that this is ideal locally first.

The markdown as generated by pandoc -f bib -t markdown -o _publications.md may fit with https://github.com/neuropoly/neuro.polymtl.ca/pull/81, if it works then it can be sourced from the other publications-related files. If it doesn't fit properly, I have a few ideas; the js could be modified, a more markdown-oriented approach could work. Pandoc might not be necessary here though, because sphinx can import .bib directly (how it's formatted and if it fits is another question).

Nick mentioned that git history for this project includes credentials from before the gsheet was made public and I've verified that this is the case. If the publish action includes bibeasy, it would be easiest to import it from a public repo. I think a reasonable solution would be to make any necessary modifications to the current repo, and then clone the repo stripping revision history (git clone --depth 1), upload it to neuropoly/ and import it from there in actions.

Edit: to clarify, after testing locally I'd like to have someone review the changes I make. I'm reasonably confident that this is an efficient way forward (it might not be!), I feel a second pair of eyes is a good idea.

[1]https://tex.stackexchange.com/questions/171793/bibtex-to-html-markdown-etc-using-pandoc [2]https://pandoc.org/MANUAL.html#pandocs-markdown

jcohenadad commented 1 year ago

If the publish action includes bibeasy, it would be easiest to import it from a public repo. I think a reasonable solution would be to make any necessary modifications to the current repo, and then clone the repo stripping revision history (git clone --depth 1), upload it to neuropoly/ and import it from there in actions.

👍

jcohenadad commented 1 year ago

About pandoc: the current version of bibeasy does already produce the MD files that can readily be used in our website. So my suggestion would be to prioritize the 'automation' (ie: run bibeasy within a github action or something), instead of implementing another approach to generate the MD files.

namgo commented 1 year ago

the current version of bibeasy does already produce the MD files that can readily be used in our website

You're right! I feel bad for missing this! The only modification right now (with this in mind) is taking the path of labels_publication.txt as an argument. python bibeasy_cli.py -l ../labels_publication.txt --type article conf-article --reverse --freshen-cache --combine -o gen.md really just works.

Github actions aren't something I've used regularly in the past, so I'll continue with Dockerfiles locally and then request a review from someone on the team, so I can migrate that to actions safely.

namgo commented 1 year ago

2023-05-30-140527_1142x1144_scrot

Here's what it looks like right now (edit: this is local still), with links added to the toctree manually. I think the only modification I'll make now is adding a top-level header to the markdown generation with the document's title. Doing that should get rid of the {2000..2023} links that were automatically added.

Is having gen- as a prefix to each generated file a problem? I could modify how the naming is created if so, but it's not used on the website text, just the url.

A small issue I ran into was that pandas has recently modified their calling api in their 2.x releases, so .append() isn't available - I guess my initial test (on Alpine) still had the old version. I think the simplest fix is to just pin to the old api (which is still maintained) so pandas<=1.5 in pip. Unfortunately this might create long build times for non-mainstream distros so I ended up also doing apt install python3-pandas in my local dockerfile and that should translate without issue to gh-actions.

namgo commented 1 year ago

2023-05-30-151832_1300x1055_scrot

I've committed and pushed my changes to remote branch "issue22" - I should have been doing that earlier, my mistake.

It doesn't have the sorting javascript but considering the data='descriptor' fields are already in the generator it shouldn't be hard. It's automated locally in Podman:

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt install -y python3 python3-pip python3-pandas

COPY bibeasy/ /files/bibeasy/
COPY neuro.polymtl.ca/ /files/neuro.polymtl.ca/
RUN cd /files/neuro.polymtl.ca/ && pip install '.[sphinx]' 
RUN cd /files/bibeasy/ && pip install .

WORKDIR /files/neuro.polymtl.ca/publications/
RUN python3 /files/bibeasy/bibeasy/scripts/bibeasy_cli.py -l /files/bibeasy/bibeasy/labels_publication.txt --type article conf-article --reverse --freshen-cache --combine -o gen.md
WORKDIR /files/neuro.polymtl.ca/
RUN make html
WORKDIR /files/neuro.polymtl.ca/_build/html

then I can just python3 -m http.server (w/ port mirroring) to check that it's working.

The modification to toctrees still applies, and requires that the filenames be consistent - which should be the case.

jcohenadad commented 1 year ago

@namgo this is very cool, but given that ultimately GH action will be used, wouldn't it be more straightforward to directly work with GH action? Here is the 'publish.yml' file: https://github.com/neuropoly/neuro.polymtl.ca/blob/master/.github/workflows/publish.yml. I presume it would take git cloning the bibeasy repos, and run the command to generate the MD file.

Cheery on the cake would be to run an automation (let's say, every week), so we don't have to rely on someone committing to udpate the publication list. But this is just an idea.

Actually we should probably move this discussion here

namgo commented 1 year ago

re github actions: Absolutely it is more effective to get github actions going, but I think my hesitation in doing so is that these are two separate projects at the moment and I wanted to make sure generation was happening properly. Now that it is, you're absolutely right that migrating to github actions is ideal.

namgo commented 1 year ago

Closing this as the discussion was moved.