packetfocus commented 6 years ago

Need to revise the entire way the data is filtered. It comes from the Burp XML so has a bunch of tags in it.

So looks like this:

The <b> SQL Injection </b> attacks against the hosts </ul> Host 1 </ul>.

The filtering now was really pieced together one replace at a time by looking at the output in the word doc. Probably need to figure out a better way like a class/function to filter out the HTML tags.

Then need to address spacing in between paragraphs. Would save even more time if the paragraphs had spaces between them.

To format the paragraphs once the HTML is stripped properly, logic needs to be integrated with python-docx to identify actions like Bolding, spacing, links etc.

packetfocus commented 6 years ago

Think this is mostly fixed. Added in the class/function to strip tags and replaced the string.replace().

`

from HTMLParser import HTMLParser

from html.parser import HTMLParser

class MLStripper(HTMLParser):

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

def handle_data(self, d):
    self.fed.append(d)

def get_data(self):
    return ''.join(self.fed)

def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()

html_string='The SQL Injection attacks against the hosts Host 1 .' tag=strip_tags(html_string) print(tag)`

packetfocus commented 6 years ago

3 Merged this change into Master from Develop and created new release.

packetfocus / BurpParser

Formatting of paragraphs. #1

from HTMLParser import HTMLParser

3 Merged this change into Master from Develop and created new release.