Parse PDFs as XML - Githubissues

karacolada commented 1 year ago

Possibly using Grobid? This would retain some structural info about the PDF, allowing us to check in probable locations for the GitHub Link (or first of all, do a sample analysis of where relevant links are usually located).

karacolada commented 1 year ago

I installed and built grobid in our EIDF VM, but can't make it work. Both a local install+build if the service and a docker service terminate without any error messages after about half a minute when being queried. There is nothing in the logs the docs point to, and the only message I get is a "remote disconnect" in the client script. Everything works fine when using the demo server, but I don't think we should try and use that for our analysis.

I've combed through the issues in grobid's repository but have found nothing helpful so far.

I'm wondering how important this is to us - do we want to spend time trying to implement Grobid? It might be very time-consuming to make it work and then also to run it on each PDF (it wants GPUs but our VM doesn't have them). To be fair, I have not had a look at a sample of our PDF data to manually check how well our fulltext search for git links performs and how many false positives we get. Should we spend time on that before dealing with Grobid again?

karacolada commented 1 year ago

We can use the page the link was found on as a proxy for now, though we should definitely do some validation of that method as it's not as common as going by sections.

softwaresaved / rse-repo-analysis

Parse PDFs as XML #4