Open prologic opened 3 years ago
Ping?
What's your expectations? You want colly to find URLs in plain text using some heuristics?
What's your expectations? You want colly to find URLs in plain text using some heuristics?
Yes. I want the same semantics that the colly API provides for HTML, but for plain text. Look for (perhaps using regexes?) things that might look like links to other documents.
In my case I'm wanting to scrape twtxt feeds looking for other links of the form <nick url>
. I realize this isn't exactly a HTML ink pe rse, but if the API could provide a custom search so I can define what to look for as "links" in documents?
What's exactly wrong with doing such regex heuristics in OnResponse
yourself?
This is what i did
Strictly speaking, Colly doesn't really define what's an "HTML link" either: it's up to user to enqueue more URLs in his own custom c.OnHTML("a[href]", ...)
handler.
I personally don't think there's a merit of incorporating such site-specific heuristics into Colly codebase, but let's see if there're other opinions about this.
I don't see an API for this besides
Collector.OnResponse
. Curious if its possible to scrapplain/text
files? My use-case is scraping a bunch of hosted text files that may have linked to other text files.