Load closest match pdf url

jonfroehlich commented 1 month ago

I'm updating some code in https://github.com/makeabilitylab/makeabilitylabwebsite/issues/1166 that will regenerate some filenames. I'm worried about how this might create inaccurate links back to our website. I've already found instances using Google Search Console. For example, two of these links won't work after our file renames are completed:

So, then I had the idea of writing a feature where--rather than loading a 404 page--if the specified PDF url is not found, we look for the closest PDF name in our database. I asked Copilot to help... and this seems like a good start:

Copilot

That sounds like a useful feature! You can achieve this by using a combination of Django views and a string similarity algorithm like Levenshtein distance to find the closest match. Here's a step-by-step guide:

1. Install the python-Levenshtein library:

pip install python-Levenshtein

2. Create a function to find the closest match:

import os
from Levenshtein import distance as levenshtein_distance

def find_closest_match(requested_filename, directory):
    files = os.listdir(directory)
    closest_match = None
    min_distance = float('inf')

    for file in files:
        if file.endswith('.pdf'):
            dist = levenshtein_distance(requested_filename, file)
            if dist < min_distance:
                min_distance = dist
                closest_match = file

    return closest_match if min_distance <= 5 else None  # Adjust threshold as needed

3. Create a Django view to handle the request:

from django.http import HttpResponse, Http404
from django.conf import settings
import os

def serve_pdf(request, filename):
    directory = os.path.join(settings.MEDIA_ROOT, 'pdfs')
    closest_match = find_closest_match(filename, directory)

    if closest_match:
        file_path = os.path.join(directory, closest_match)
        with open(file_path, 'rb') as pdf:
            response = HttpResponse(pdf.read(), content_type='application/pdf')
            response['Content-Disposition'] = f'inline; filename="{closest_match}"'
            return response
    else:
        raise Http404("PDF not found")

4. Update your urls.py:

from django.urls import path
from .views import serve_pdf

urlpatterns = [
    path('pdfs/<str:filename>/', serve_pdf, name='serve_pdf'),
]

This setup will attempt to find the closest matching PDF file based on the filename requested by the user and serve it if the similarity is within the defined threshold. Adjust the threshold in the find_closest_match function as needed to suit your requirements. Would you like more details on any specific part of this process?

jonfroehlich commented 1 month ago

Gemini suggested using difflib:

I'd be glad to help you with that. Here's a comprehensive approach to implementing the requested functionality in Django:

1. Define a Model:

Create a Django model to store information about your PDFs, including their filenames and any relevant metadata:

from django.db import models

class PDF(models.Model):
    filename = models.CharField(max_length=255, unique=True)
    # Add other fields as needed (e.g., description, upload date)

    def __str__(self):
        return self.filename

2. Implement a Custom View:

Create a Django view that handles PDF requests and searches for the closest match:

from django.shortcuts import render, HttpResponse
from .models import PDF
import difflib

def pdf_view(request, filename):
    try:
        pdf_instance = PDF.objects.get(filename=filename)
        # Handle the case where the PDF exists and return it
    except PDF.DoesNotExist:
        # Search for closest match
        all_filenames = PDF.objects.values_list('filename', flat=True)
        closest_match = difflib.get_close_matches(filename, all_filenames, n=1, cutoff=0.8)[0]

        # Check if the match is close enough
        if difflib.SequenceMatcher(None, filename, closest_match).ratio() >= 0.8:
            # Redirect to the closest match
            return HttpResponseRedirect(reverse('pdf_view', args=[closest_match]))
        else:
            # Handle the case where no suitable match is found
            raise Http404("PDF not found")

3. Customize the Search Logic:

Threshold: Adjust the cutoff parameter in difflib.get_close_matches to control the minimum similarity required for a match.
Matching Algorithm: Consider using other string matching algorithms like Levenshtein distance or TF-IDF if necessary.
Additional Criteria: If you have more metadata about your PDFs (e.g., tags, categories), you can incorporate them into the search logic to improve accuracy.

4. Handle PDF Responses:

In the try block of the view, you can use appropriate Django mechanisms to serve the PDF content. For example:

with open(pdf_instance.file.path, 'rb') as pdf:
    response = HttpResponse(pdf.read(), content_type='application/pdf')
    response['Content-Disposition'] = 'attachment; filename="%s"' % pdf_instance.filename
    return response

5. Configure URLs:

Map the view to a URL pattern in your urls.py file:

from django.urls import path
from .views import pdf_view

urlpatterns = [
    path('pdf/<str:filename>/', pdf_view, name='pdf_view'),
]

By following these steps and customizing the search logic to your specific requirements, you can effectively implement a mechanism to find and serve the closest matching PDF in Django when a requested file is not found.

jonfroehlich commented 1 month ago

This works on localhost but not on test or production. Not sure why. Still investigating.

jonfroehlich commented 1 month ago

Still working on localhost but not on test or production. One idea might be the web hosting service...

Web Server Configuration: If you are using a web server like Nginx or Apache, ensure it is not intercepting the request and serving the file directly. You might need to adjust your web server configuration to pass the request to Django.

jonfroehlich commented 1 month ago

ChatGPT suggests we might need to configure things on our Apache server. Not sure why we wouldn't have to do this locally as well (on localhost)

Great, thanks for confirming! Since you're using Apache, you'll need to ensure that your Apache configuration is set up to pass requests for media files to Django rather than serving them directly. Here are some steps to help you configure Apache:

Apache Configuration: Locate your Apache configuration file, which might be named something like 000-default.conf or your_site.conf, typically found in /etc/apache2/sites-available/.

ProxyPass Configuration: Ensure that you have a ProxyPass directive to pass requests to your Django application. Here’s an example configuration:

<VirtualHost *:80>
   ServerName yourdomain.com

   # Serve static files directly
   Alias /static /path/to/static
   <Directory /path/to/static>
       Require all granted
   </Directory>

   # Serve media files directly
   Alias /media /path/to/media
   <Directory /path/to/media>
       Require all granted
   </Directory>

   # Proxy requests to Django application
   ProxyPass / http://127.0.0.1:8000/
   ProxyPassReverse / http://127.0.0.1:8000/

   # Ensure the following modules are enabled
   # a2enmod proxy
   # a2enmod proxy_http
</VirtualHost>

Enable Required Modules: Ensure that the necessary Apache modules are enabled:
```
sudo a2enmod proxy
sudo a2enmod proxy_http
sudo systemctl restart apache2
```
Check Permissions: Make sure the media files have the correct permissions and are accessible by the Apache server.
Restart Apache: After making changes to the configuration, restart Apache to apply the changes:
```
sudo systemctl restart apache2
```

By following these steps, you should be able to configure Apache to correctly pass requests for media files to your Django application. If you encounter any issues or need further assistance, feel free to ask!

makeabilitylab / makeabilitylabwebsite

Load closest match pdf url #1173

Copilot