Open clewis96 opened 1 week ago
Hi, you can use this Python script to do this:
import json
import subprocess
from pathlib import Path
from os import listdir
from os.path import join
def analyze_documents(pdfs_path: str, jsons_path: str):
for file in listdir(pdfs_path):
file_path = join(pdfs_path, file)
command = [
"curl",
"-X",
"POST",
"-F",
f"file=@{file_path}",
"localhost:5060",
]
result = subprocess.run(command, capture_output=True, text=True)
json_data = json.loads(result.stdout)
Path(join(jsons_path, file.replace(".pdf", ".json"))).write_text(json.dumps(json_data, indent=4))
if __name__ == '__main__':
pdfs_path = "/path/to/pdfs/folder"
jsons_path = "/path/to/output/jsons/folder"
analyze_documents(pdfs_path, jsons_path)
Don't forget to put your pdfs_path and jsons_path for output. Hope this helps!
@clewis96
Thank you for your input.
The script you shared works well. I had to create the output folder beforehand and use my own paths, but it successfully processed all the PDFs. If you can share the error text, we can try to help you.
We will incorporate this functionality into the service.
Have a great day!
Hi @gabriel-piles - thanks for trying this out! So I tried adding this bash script I gave you (named pdf_txt.sh) to the pdf-document-layout-analysis-main folder and then tried through the docker terminal to build the script using chmod +x pdf_txt.sh
. However, when I try to then run the script after in the Docker terminal using .\pdt_txt.sh
, it says: zsh: command not found: .pdf_txt.sh
- which made me think I just was not building this properly in docker because I didn't adjust or change the docker file/makefile.
hi @clewis96,
Instead of running the script inside Docker, you can start the service using "make start" and then execute the script in a regular terminal (outside of Docker) with ./path/to/script.sh. You do not need to alter the Docker container build process for this to function.
The terminal might display "command not found" if you execute ".pdf_txt.sh" instead of "./pdf_txt.sh". If the script doesn't exist, the error message should be "no such file or directory"
I hope you find the fix.
@gabriel-piles - I got it working! Thank you so much for your help. I am engineer turned law student so it has been a while since I have worked with code. Two comments: 1. I really do think this would be a great feature to add for anyone wanting to automate this on multiple pdfs in the future and 2. I am wondering what the motivating problem you all were trying to solve when you created this LLM and repo?
Thank you for your time and help!
Hello,
I am currently using this repo (which is amazing by the way!!) to convert legal documents in PDF form to JSON and text formatting for the purpose of having clean text for future sentiment and textual analysis work. However, I want to be able to run this on multiple PDFs in a single folder automatically. I wrote a bash script that incorporates your curl command to do that, but I am not extremely familiar with Docker so I have not been able to get it to run properly.
Is there anyway you could add a script that runs a for loop and converts all PDFs in a single folder to JSONs? Or, help me get this script running within docker? I think this feature could be useful beyond just my use case, and for anyone really who is converting a big corpus of PDFs to text. Ideally you can just point to the input folder where the PDFs are stored and output folder to store the JSON files. I have not been able to test my script since I mentioned I have no been able to get it to run properly in docker using chmod + command, but here is what I was thinking:
Thank you so much for your help!