huridocs / pdf-document-layout-analysis

A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on.
Apache License 2.0
100 stars 11 forks source link

Run the curl command in a for loop for multiple pdfs in a single folder #75

Open clewis96 opened 1 week ago

clewis96 commented 1 week ago

Hello,

I am currently using this repo (which is amazing by the way!!) to convert legal documents in PDF form to JSON and text formatting for the purpose of having clean text for future sentiment and textual analysis work. However, I want to be able to run this on multiple PDFs in a single folder automatically. I wrote a bash script that incorporates your curl command to do that, but I am not extremely familiar with Docker so I have not been able to get it to run properly.

Is there anyway you could add a script that runs a for loop and converts all PDFs in a single folder to JSONs? Or, help me get this script running within docker? I think this feature could be useful beyond just my use case, and for anyone really who is converting a big corpus of PDFs to text. Ideally you can just point to the input folder where the PDFs are stored and output folder to store the JSON files. I have not been able to test my script since I mentioned I have no been able to get it to run properly in docker using chmod + command, but here is what I was thinking:

#!/bin/bash

# Directory containing the PDFs
PDF_DIR="/input_pdfs_test"

# Server URL
SERVER_URL="http://localhost:5060"

# Directory to store the output files
OUTPUT_DIR="/output"

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Loop through all PDF files in the directory
for pdf_file in "$PDF_DIR"/*.pdf; do
    # Extract the base name of the PDF file (e.g., "document.pdf" from "/pdfs/document.pdf")
    base_name=$(basename "$pdf_file")

    # Define the output file name (e.g., "document_output.json")
    output_file="$OUTPUT_DIR/${base_name%.pdf}_output.json"

    # Run the curl command for each PDF and store the output in the corresponding file
    curl -X POST -F "file=@${pdf_file}" "$SERVER_URL" > "$output_file"
done

Thank you so much for your help!

ali6parmak commented 1 week ago

Hi, you can use this Python script to do this:

import json
import subprocess
from pathlib import Path
from os import listdir
from os.path import join

def analyze_documents(pdfs_path: str, jsons_path: str):
    for file in listdir(pdfs_path):
        file_path = join(pdfs_path, file)
        command = [
            "curl",
            "-X",
            "POST",
            "-F",
            f"file=@{file_path}",
            "localhost:5060",
        ]

        result = subprocess.run(command, capture_output=True, text=True)
        json_data = json.loads(result.stdout)
        Path(join(jsons_path, file.replace(".pdf", ".json"))).write_text(json.dumps(json_data, indent=4))

if __name__ == '__main__':
    pdfs_path = "/path/to/pdfs/folder"
    jsons_path = "/path/to/output/jsons/folder"
    analyze_documents(pdfs_path, jsons_path)

Don't forget to put your pdfs_path and jsons_path for output. Hope this helps!

gabriel-piles commented 1 week ago

@clewis96

Thank you for your input.

The script you shared works well. I had to create the output folder beforehand and use my own paths, but it successfully processed all the PDFs. If you can share the error text, we can try to help you.

We will incorporate this functionality into the service.

Have a great day!

clewis96 commented 1 week ago

Hi @gabriel-piles - thanks for trying this out! So I tried adding this bash script I gave you (named pdf_txt.sh) to the pdf-document-layout-analysis-main folder and then tried through the docker terminal to build the script using chmod +x pdf_txt.sh. However, when I try to then run the script after in the Docker terminal using .\pdt_txt.sh, it says: zsh: command not found: .pdf_txt.sh - which made me think I just was not building this properly in docker because I didn't adjust or change the docker file/makefile.

gabriel-piles commented 1 week ago

hi @clewis96,

Instead of running the script inside Docker, you can start the service using "make start" and then execute the script in a regular terminal (outside of Docker) with ./path/to/script.sh. You do not need to alter the Docker container build process for this to function.

The terminal might display "command not found" if you execute ".pdf_txt.sh" instead of "./pdf_txt.sh". If the script doesn't exist, the error message should be "no such file or directory"

I hope you find the fix.

clewis96 commented 1 week ago

@gabriel-piles - I got it working! Thank you so much for your help. I am engineer turned law student so it has been a while since I have worked with code. Two comments: 1. I really do think this would be a great feature to add for anyone wanting to automate this on multiple pdfs in the future and 2. I am wondering what the motivating problem you all were trying to solve when you created this LLM and repo?

Thank you for your time and help!