Open havishak opened 1 month ago
Hi @havishak
Thanks for this! 🎉 Yes, you definitely could do something like this (and it is on my list of things to do), but there were a few reasons I haven't done it yet:
extract_data()
function doesn't work for poetry. That data is formatted in a different way (and so the extract_data()
function will be decomposed into two functions extract_play()
and extract_poem()
first).extract_data()
doesn't capture e.g. in Romeo and Juliet. I'd prefer to fix this first before running it on all datasets to make it easier to debug. Thanks for the suggestion and code - I'll likely come back to this issue when I'm ready to run it on everything so I'll leave it open 🎉
A couple of notes:
delim = "\n\n"
to separate on double spaces which get's rid of problem with "" and "The"scrape(start_session)
multiple times on the same URL in one function since the data is already there.Thanks for the notes!
Hi @nrennie,
This is a cool dataset, good work! As I was reviewing the code, I wondered if you could loop through the plays to scrape all together. Something like this -