scienceverse / papercheck

Check Scientific Papers for Best Practices
https://scienceverse.github.io/papercheck/
Other
1 stars 0 forks source link

convert folder of pdf files into Grobid xml #5

Closed Lakens closed 1 month ago

Lakens commented 4 months ago

Generate xml files

1. Install Docker and run it.

2. Run Grobid using the following code in the command line: docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

Load the packages

library(httr) library(purrr)

process_pdf <- function(pdf_path) {

Set the URL for the API endpoint

url <- "http://localhost:8070/api/processFulltextDocument"

Create a named list with the file path

file_path <- list(input = upload_file(pdf_path))

Make the POST request

response <- POST(url, body = file_path, encode = "multipart", verbose())

Check if the request was successful

if (http_status(response)$category == "Success") {

Save the response content to a file

content <- content(response, as = "raw")
output_file <- gsub("\\.pdf$", ".xml", pdf_path)
writeBin(content, output_file)
cat("xml saved to:", output_file, "\n")

} else { cat("Error:", http_status(response)$reason, "\n") } }

Get list of PDF files in folder

pdf_files <- list.files(path = "C:\Users\DLakens\surfdrive - Lakens, Daniël@surfdrive.surf.nl\R\download_articles_code_and_data\sage_open", pattern = "\.pdf$", full.names = TRUE)

Process each PDF file

walk(pdf_files, process_pdf)