joeyism / linkedin_scraper

A library that scrapes Linkedin for user data
GNU General Public License v3.0
2.01k stars 560 forks source link

Fix for Appending Paths without Checking for Query Strings in URLs #193

Open XYZliang opened 1 year ago

XYZliang commented 1 year ago

While integrating with linkedin_scraper, I've come across a potential issue where paths are directly appended to URLs without checking for the presence of query strings. This leads to malformed URLs if the original URL contains a query string.

Current Behavior: When appending a path to a URL that already has a query string, the result is a malformed URL. For example, appending details/experience to https://www.linkedin.com/in/douglas-b-b23472b/?trk=people-guest_people_search-card results in https://www.linkedin.com/in/douglas-b-b23472b/?trk=people-guest_people_search-card instead of the desired https://www.linkedin.com/in/douglas-b-b23472b/details/experience?trk=people-guest_people_search-card

Suggested Fix: Before appending the path, the package should check for the presence of a query string in the URL. If one exists, the path should be inserted before the query string, and then the query string should be appended after the path. Utilizing Python's urlparse can help efficiently manage and restructure the URL.

Impact: This change will ensure that the URLs constructed by linkedin_scraper are always correctly formatted and valid, reducing potential issues for downstream users and systems.

I believe this fix would greatly enhance the robustness of URL handling in the package. Please let me know if more information or context is needed, and I'd be happy to help further!

mhoualla commented 11 months ago

@alicemy478 and I are interested in investigating this issue. After reviewing the latest commits, it appears that the problem is still present. We could work on a solution that checks for the presence of a query string in the URL before appending the path.