kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Issue with processFulltextDocument API: Frequent "503 Service Unavailable" Responses #1195

Open sdspieg opened 2 weeks ago

sdspieg commented 2 weeks ago

Operating System and architecture (arm64, amd64, x86, etc.)

x64/wsl

What is your Java version

OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu324.04.1)

I’m encountering an issue where my script for batch-processing PDFs using GROBID’s processFulltextDocument API frequently generates output files containing the following error message:

html
Copy code
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 503 Service Unavailable</title>
</head>
<body><h2>HTTP ERROR 503 Service Unavailable</h2>
<table>
<tr><th>URI:</th><td>/api/processFulltextDocument</td></tr>
<tr><th>STATUS:</th><td>503</td></tr>
<tr><th>MESSAGE:</th><td>Service Unavailable</td></tr>
<tr><th>SERVLET:</th><td>jersey</td></tr>
</table>
</body>
</html>

Script Setup: I'm running a script that uses requests with ThreadPoolExecutor to submit multiple PDFs to the API in parallel. Here is a sample of my code:

python
Copy code
import os
import requests
from tqdm.auto import tqdm
import concurrent.futures
import logging

# Define the target directory and subdirectories based on your provided structure
target_directory = '/mnt/g/My Drive/RuBase/Corpora/Russian-Ukrainian war'
raw_pdfs_path = os.path.join(target_directory, 'full_text')
processed_pdfs_path = os.path.join(target_directory, 'grobid_processed_pdfs')
logs_dir = os.path.join(target_directory, 'logs')

# Ensure the processed PDFs and logs directories exist
os.makedirs(processed_pdfs_path, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

# Set up logging to file
log_file_path = os.path.join(logs_dir, 'pdf_processing_log.log')
logging.basicConfig(filename=log_file_path, level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Print directory setup information
print("Directory set for raw PDFs:", os.path.abspath(raw_pdfs_path))
print("Directory set for processed PDFs:", os.path.abspath(processed_pdfs_path))
print("Logs will be saved to:", os.path.abspath(log_file_path))

# Count the number of PDF files in the raw_pdfs directory (excluding subdirectories)
pdf_files = [
    f for f in os.listdir(raw_pdfs_path)
    if os.path.isfile(os.path.join(raw_pdfs_path, f)) and f.endswith('.pdf')
]
num_pdfs = len(pdf_files)
print(f"Number of PDF files to process: {num_pdfs}")

# URL for the GROBID service
url = 'http://localhost:8070/api/processFulltextDocument'

# Function to process a single PDF file
def process_pdf(pdf_file):
    pdf_file_path = os.path.join(raw_pdfs_path, pdf_file)
    json_file_path = os.path.join(processed_pdfs_path, pdf_file.replace('.pdf', '.json'))

    # Skip if the JSON file already exists
    if os.path.exists(json_file_path):
        message = f"Skipped {pdf_file} (already processed)"
        logging.info(message)
        return message

    try:
        with open(pdf_file_path, 'rb') as file:
            files = {'input': file}
            response = requests.post(url, files=files)

        # Write the response to the processed PDFs path
        with open(json_file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(response.text)

        message = f"Processed {pdf_file} and saved output to {json_file_path}"
        logging.info(message)
        return message
    except Exception as e:
        message = f"Failed to process {pdf_file}: {str(e)}"
        logging.error(message)
        return message

# Process PDFs using concurrent futures
batch_size = 10  # Adjust the batch size as needed

with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = {executor.submit(process_pdf, pdf): pdf for pdf in pdf_files}
    for future in tqdm(concurrent.futures.as_completed(futures), total=num_pdfs, desc="Processing PDFs"):
        pdf = futures[future]
        try:
            result = future.result()
            tqdm.write(result)
        except Exception as e:
            error_message = f"Error processing {pdf}: {str(e)}"
            tqdm.write(error_message)
            logging.error(error_message)

print("All files have been processed.")
print(f"Check the log file for details: {os.path.abspath(log_file_path)}")

Frequency of 503 Errors: This "503 Service Unavailable" error occurs frequently, causing the script to create many output files that only contain the HTML error response instead of the JSON output. Error in GROBID Logs: The GROBID logs show recurring entries indicating high load, but the server’s capacity or rate limits are unclear. Questions:

Rate Limits: Are there any known rate limits or maximum concurrent request limits for the GROBID processFulltextDocument endpoint?

Server Tuning: Are there specific server or configuration adjustments (e.g., thread limits, queue sizes) recommended for handling large-scale batch requests? Best Practices for Batch Processing: Any tips on structuring requests (e.g., delays or reduced concurrency) to minimize the risk of overloading GROBID? Any guidance on configuring GROBID or adjusting my script to avoid this 503 error would be greatly appreciated. Thank you!

kermitt2 commented 2 weeks ago

Hello @sdspieg

Sending 503 error is a normal behavior of GROBID and it means that the pool of threads is entirely used (max parallel requests). As documented, it means that the client has to wait a bit before sending new requests, the time that a new thread is available.

For a reference implementation on how to use the service in parallel, please look at the Grobid clients, in particular the Python client, other languages are available.

The documentation also indicates how to modify the size of the pool of threads, to adapt it to the server running the service.