laughingclouds / Scrapia-World

A web scraper for scraping wuxiaworld. Written in python, using selenium and python cmd for an interactive shell experience with a command line utility to work with text along with a database to store information.
MIT License
2 stars 1 forks source link

Formatting the saved pages #16

Open laughingclouds opened 2 years ago

laughingclouds commented 2 years ago

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context None

laughingclouds commented 2 years ago

I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.

One way might be

# p.text represents code within paragraph tag
for p in soup.findAll("p"):
   print(p.text)

But when I ran this against a document, the output was garbled. I checked the doc and there were too many '\n' characters within the paragraphs.

What we could do is format the text within every paragraph. So we save a bunch of desired tags, and insert them all in an html template.

I was also thinking of storing that "template" html code along with a style rules in a separate place.

laughingclouds commented 2 years ago

This piece of code does a good job with dealing with the text formatting. It needs improvements.

from bs4 import BeautifulSoup

def fixLine(lineText: str):
    """lineText is a single line of a paragraph"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

def fixPara(pText: str):
    """pText is text within a paragraph tag"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

fName = "HTML_FILE_NAME"
with open(fName) as fp:
    soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
    s += fixPara(p.text) + '\n'
s = s.rstrip('\n')