SGMLParser processing <tr> which include two <a> will have problem

56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago

BPO	16513
Nosy	@ezio-melotti
Files	testbug.py: test python file issue16513.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'type-bug', 'library'] title = 'SGMLParser processing which include two will have problem' updated_at = user = 'https://bugs.python.org/moonflow' ``` bugs.python.org fields: ```python activity = actor = 'ezio.melotti' assignee = 'none' closed = True closed_date = closer = 'ezio.melotti' components = ['Library (Lib)'] creation = creator = 'moonflow' dependencies = [] files = ['28049', '28050'] hgrepos = [] issue_num = 16513 keywords = [] message_count = 5.0 messages = ['175990', '175992', '175993', '175994', '175995'] nosy_count = 2.0 nosy_names = ['ezio.melotti', 'moonflow'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue16513' versions = ['Python 2.7'] ```

56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago

if a \<tr> include two \<a> or more,SGMLParser processing has a problem

for example:
    <tr>
    <td align="center" valign="top" nowrap>
    <script language="Javascript">
    <!--
      if ( 4 == 4 ) document.write("<strong class=\"Critical small\">Critical</strong>");
      if ( 4 == 3 ) document.write("<strong class=\"High small\">High</strong>");
      if ( 4 == 2 ) document.write("<strong class=\"Medium small\">Medium</strong>");
      if ( 4 == 1 ) document.write("<strong class=\"Low small\">Low</strong>");
    //--> 
    </script>
    </td>
    <td valign="top" align="center" nowrap>
    <small><script type="text/javascript">document.write(FormatDate("%d-%b-%y", "2012", "11", "18"));</script></small>
    </td>
    <td valign="top" align="center" nowrap><small>
    <a title="CPAI-2012-809" style="text-transform:uppercase" href="2012/cpai-08-nov.html">
    CPAI-2012-809</a></small>
    </td>
    <td valign="top" nowrap align="center"><small>
    <a target="_blank" href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-2089">CVE-2011-2089</a><br /></small>
    </td>
    <td valign="top"><small>SCADA ICONICS WebHMI ActiveX Stack Overflow (2011-2089)</small></td>
    </tr>

def start_a(self, attrs):
        if self.is_td:       
            cve_href = [v for k, v in attrs if k == "target" and v == "_blank"]
            if cve_href:
                self.is_a = True
                self.is_cve = True

            #for SGMLParser maybe have a bug,a <tr> have two <a> has problem
            vul_href = [v for k, v in attrs if k == "style"]
            print vul_href
            if vul_href:
                vul_href = "".join([v for k, v in attrs if k == "href"])
                if vul_href.find("cve") == -1:
                    self.href_name = vul_href     
            else:
                self.href_name = ""

here print vul_href but print nothing.Is it ok?

ezio-melotti commented 12 years ago

Have you tried with HTMLParser? sgmllib is deprecated and has been removed in Python 3. HTMLParser is also much better at parsing (broken) HTML.

56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago

I haven't tried it, the problem will not process?

ezio-melotti commented 12 years ago

If what you are trying to do is extracting the link(s) that contain 'cve', you try the attached script.

ezio-melotti commented 12 years ago

Sorry, I misread your code, looks like you want the href *without* 'cve'. In that case change my code to use "'cve' not in attrs['href']" (also avoid using s.find('cve') == -1 , and use the more readable and idiomatic 'cve' not in s ).

I think your original script doesn't work for two reasons: 1) you are looking for a table with class="tablesorter", but in the HTML the table doesn't have that class, so self.is_table is never set to True; 2) you are finding the href of the \<a> with a "style" attribute and correctly setting it to self.href_name, but the value is then replaced by "" when the following \<a> without "style" is found;

That said, I still suggest you to abandon sgmllib and use HTMLParser, or possibly an external module like BeautifulSoup or LXML.

python / cpython

SGMLParser processing <tr> which include two <a> will have problem #60717