Open laughingclouds opened 2 years ago
I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.
One way might be
# p.text represents code within paragraph tag
for p in soup.findAll("p"):
print(p.text)
But when I ran this against a document, the output was garbled. I checked the doc and there were too many '\n' characters within the paragraphs.
What we could do is format the text within every paragraph. So we save a bunch of desired tags, and insert them all in an html template.
I was also thinking of storing that "template" html code along with a style rules in a separate place.
This piece of code does a good job with dealing with the text formatting. It needs improvements.
from bs4 import BeautifulSoup
def fixLine(lineText: str):
"""lineText is a single line of a paragraph"""
words = lineText.split()
newText = " ".join([word for word in words if word != " "])
return newText
def fixPara(pText: str):
"""pText is text within a paragraph tag"""
words = lineText.split()
newText = " ".join([word for word in words if word != " "])
return newText
fName = "HTML_FILE_NAME"
with open(fName) as fp:
soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
s += fixPara(p.text) + '\n'
s = s.rstrip('\n')
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Describe alternatives you've considered
Additional context None