Closed yulin0629 closed 2 weeks ago
@yulin0629 I wrote this internally for my own purposes. In my testing, it handles all use cases. Not sure if this is helpful
from urllib.parse import urljoin, urlparse, urlunparse, ParseResult
# Input URL example
input_url = "https://abc.com/news"
# Helper function to correctly construct full URLs
def get_full_url(input_url, partial_url):
# Step 1: Parse the input URL to get the base URL and path
parsed_input_url = urlparse(input_url)
base_url = f"{parsed_input_url.scheme}://{parsed_input_url.netloc}"
base_path = parsed_input_url.path # This is /news
# Case 1: If the partial URL starts with '?', keep the base path and add the query
if partial_url.startswith('?'):
parsed_partial = urlparse(partial_url) # Parse the query string
# Construct the new URL with the base path and query string
return urlunparse(ParseResult(
scheme=parsed_input_url.scheme,
netloc=parsed_input_url.netloc,
path=base_path, # Keep the /news path
params='', # No parameters in this case
query=parsed_partial.query, # Query from the partial URL
fragment='' # No fragment
))
# Case 2: For other relative or absolute paths, use urljoin
return urljoin(base_url, partial_url)
# # Example: Find all <a> tags and get their href attributes
# for link in soup.find_all('a', href=True):
# partial_url = link['href']
# full_url = get_full_url(base_url, partial_url)
# print(f"Partial: {partial_url} => Full: {full_url}")
partial_urls = ["?page=2", "/news/abc.html", "https://abc.com/news/abc.html"]
for partial_url in partial_urls:
full_url = get_full_url(input_url, partial_url)
print(f"Partial: {partial_url} => Full: {full_url}")
Thank you for sharing your get_full_url
implementation! I've done some thorough testing comparing it with Python's built-in urljoin
, and I'd like to share my findings.
While your implementation works for many cases, I found some edge cases where it behaves differently from the standard urljoin
, particularly with relative paths when the base URL ends with a slash. Here's what I discovered:
base_url = "https://example.com/news/"
# Case 1: Relative path
href = "relative/path.html?page=2"
urljoin: "https://example.com/news/relative/path.html?page=2"
get_full_url: "https://example.com/relative/path.html?page=2"
# Case 2: Current directory path
href = "./current/path.html?page=2"
urljoin: "https://example.com/news/current/path.html?page=2"
get_full_url: "https://example.com/current/path.html?page=2"
The main difference is that urljoin
maintains the directory structure more accurately according to URL specifications. While your implementation works for your specific use cases, I would recommend using Python's built-in urljoin
for the following reasons:
@yulin0629 Your pull request is already merged and available in the current version. For the next version, consider using a strategy design pattern to allow end-users to pass their own normalized function, providing more flexibility. I believe it's challenging to create a universal solution, but users can easily create their own strategies, especially when considering specific details.
Replaces manual URL path resolution with Python's urljoin for more robust and standardized URL normalization. This prevents edge cases and ensures compliance with RFC 3986 URL specifications.
The change simplifies maintenance while handling all relative path cases correctly, including fragments and parent directory traversal.
Fixes: #231