Scraping themes - Githubissues

nshdesai commented 2 years ago

The scraper notebooks look good, but I think it would be worth our time to explore the feasibility of building a slightly more general/robust scraper. Consider a script that takes as input a url / list of urls, and outputs cleaned strings that will be used as prompts. This will be hard to pull of for any general URL, but assuming that the URL actually contains a list of such strings, the script could be told to look for a specific tag/pattern that would help extract them.

nshdesai commented 2 years ago

I think it might be a bit too soon to merge this. I was hoping @baronet2 and possibly myself could add the general scraping functionality as mentioned above first. Once that is done we can merge it into main.

gkysaad commented 2 years ago

oh mb, I assumed it was ready to merge since I got tagged to review. I'll revert it

baronet2 commented 2 years ago

@nshdesai I don't think it's feasible to get this general scraper function you have in mind to a reasonably useful state within a short amount of time. Each of these scrapers takes about 10 minutes start to finish, of which most is identifying and addressing the quirks of the particular page.

What you imagine would probably end up looking like:

def get_prompts(url, soup_to_prompts, clean_prompts):
    soup = ...
    list_of_prompts = soup_to_prompts(soup)
    cleaned_prompts = clean_prompts(list_of_prompts )

where you pass in the soup_to_prompts and clean_prompts function. Not really saving a lot of time/coding.

If you have an idea I'm missing, lmk.

nshdesai commented 2 years ago

What you saying makes sense, but I was also thinking of providing "tag supervision" to the scraping function (so look for <li>, etc.) that might make it more reasonable. So this would make a function that looks similar to the one above, but with a tag_pattern arg.

def get_prompts(url, tag_pattern, soup_to_prompts, clean_prompts):
    soup = ...
    list_of_prompts = soup_to_prompts(soup, tag_pattern)
    cleaned_prompts = clean_prompts(list_of_prompts )

And in soup_to_prompts,


def soup_to_prompts(url, tag_pattern):
    soup.find(tag_pattern) # bs4 + regex (if needed) magic
    ...

The main benefit of this is that we could potentially have a service that just consumes urls and updates prompt records. If you think the engineering + time cost is too high though, we don't have to do it, but FWIW my rationale was:

automated scraping over limited urls > manual scraping over arbitrary urls

The primary upside I see to this is that it produces more link data. After a few attempts of trying to generate datasets using big sleep, I am realizing that most of the prompts we already have don't produce decent images. If we want to have a high-quality dataset we might have to end up discarding a lot of the prompts anyways, so it might be easier to remove any bottlenecks with sourcing the prompts instead of easing the bottlenecks with prompt to (good) image throughput.

baronet2 commented 2 years ago

The main benefit of this is that we could potentially have a service that just consumes urls and updates prompt records.

Not really, you still have to go digging through the HTML for the right tag for each page (best-case scenario). If there's more complicated parsing, you need to do this on a case-by-case basis anyways.

but I was also thinking of providing "tag supervision" to the scraping function

This is sounding to me like we're basically wrapping the requests stuff and renaming soup.find(). Most of the work associated with manual scraping (finding the right tags/chain of tags and post-processing) will still be manual. Is that not your impression?

I am realizing that most of the prompts we already have don't produce decent images.

Agreed, let's discuss this elsewhere. I have some ideas. 😄

nshdesai / deepdixit

Scraping themes #5