Open packetfocus opened 6 years ago
Think this is mostly fixed. Added in the class/function to strip tags and replaced the string.replace().
`
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()
html_string='The SQL Injection attacks against the hosts Host 1 .' tag=strip_tags(html_string) print(tag)`
Need to revise the entire way the data is filtered. It comes from the Burp XML so has a bunch of tags in it.
So looks like this:
The <b> SQL Injection </b> attacks against the hosts </ul> Host 1 </ul>.
The filtering now was really pieced together one replace at a time by looking at the output in the word doc. Probably need to figure out a better way like a class/function to filter out the HTML tags.
Then need to address spacing in between paragraphs. Would save even more time if the paragraphs had spaces between them.
To format the paragraphs once the HTML is stripped properly, logic needs to be integrated with python-docx to identify actions like Bolding, spacing, links etc.