refuel-ai / autolabel

Label, clean and enrich text datasets with LLMs.
https://docs.refuel.ai/
MIT License
2.09k stars 147 forks source link

Validate webpage scrape URLs #894

Closed Abhinav-Naikawadi closed 2 months ago

Abhinav-Naikawadi commented 2 months ago

Pull Review Summary

Description

Validate webpage scrape URLs using regex. The motivation is to avoid making web scrape requests unnecessarily when the URL is invalid. It is possible that there may be some URL's for which this regex match does not validate correctly. However, these cases should be very niche and this change will avoid several unnecessary web scrape requests for a number of cases where the URL is obviously invalid.

Type of change

Tests

Tested locally with common valid and invalid URL's/patterns