laughingclouds / Scrapia-World

A web scraper for scraping wuxiaworld. Written in python, using selenium and python cmd for an interactive shell experience with a command line utility to work with text along with a database to store information.
MIT License
2 stars 1 forks source link

Create a novel "profile" instead #10

Closed laughingclouds closed 2 years ago

laughingclouds commented 2 years ago

Rather than going to the novel page and then searching for the chapter to click everytime the script is run, we can save the links to every chapter of the novel.

We can go to the required chapter using that link after logging in.

From then, we can simply click the "next" button and keep track of the current_chapter.

For this, we need to first create a "profile" of the novel to scrape.

For we first need to open the accordian for every chapter image

Which can be done by finding all accordian div elements image

find using

driver.find_elements(By.XPATH, "//div[contains(@class, 'grid') and contains(@class, 'grid-cols-1') and contains(@class, 'md:grid-cols-2') and contains(@class, 'w-full')]")

for a

<div class="grid grid-cols-1 md:grid-cols-2 w-full"></div>

element

We might need to open the accordians as well

let spanList = document.getElementsByTagName("span");
for (let span of spanList){
  if (span.innerText.startsWith("Volume")){
    span.click();
  }
}

Each of these elements has hrefs to the chapters of the novel. Store them with the indexing.

laughingclouds commented 2 years ago

Alright, it seems we won't need to open the accordian. Once we find the div elements that has the chapter links we can input their list to this function

def get_hrefList(divList: list[WebElement]) -> list[str]:    
    hrefList = []
    for divElement in divList:
        aList: list[WebElement] = divElement.find_elements(By.TAG_NAME, "a")
        hrefList.extend([a.get_attribute("href") for a in aList])
    return hrefList

Also, we need to switch this to "Oldest". Because then the links will be saved with the current index, i.e., prologue/chapter 0 at index 0, chapter 1 at index 1 and so on. [Too much trouble, we can simply reverse the list]

image

laughingclouds commented 2 years ago

Done https://github.com/laughingclouds/Scrapia-World/issues/4#issuecomment-1030679387

At the end, I had to open the accordians for the script to work. But yes, it's done. At least the part which creates the profile is done. There's more work to do with the profiler.

Commit where it's done 4c06df835ca5e424ae1bbcc2ab67e80f6a22c778.