python / cpython

The Python programming language
https://www.python.org
Other
63.76k stars 30.54k forks source link

SGMLParser processing <tr> which include two <a> will have problem #60717

Closed 56fc99a9-0679-42bb-a902-02cdc41bf2b9 closed 12 years ago

56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago
BPO 16513
Nosy @ezio-melotti
Files
  • testbug.py: test python file
  • issue16513.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'type-bug', 'library'] title = 'SGMLParser processing which include two will have problem' updated_at = user = 'https://bugs.python.org/moonflow' ``` bugs.python.org fields: ```python activity = actor = 'ezio.melotti' assignee = 'none' closed = True closed_date = closer = 'ezio.melotti' components = ['Library (Lib)'] creation = creator = 'moonflow' dependencies = [] files = ['28049', '28050'] hgrepos = [] issue_num = 16513 keywords = [] message_count = 5.0 messages = ['175990', '175992', '175993', '175994', '175995'] nosy_count = 2.0 nosy_names = ['ezio.melotti', 'moonflow'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue16513' versions = ['Python 2.7'] ```

    56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago

    if a \<tr> include two \<a> or more,SGMLParser processing has a problem

    for example:
        <tr>
        <td align="center" valign="top" nowrap>
        <script language="Javascript">
        <!--
          if ( 4 == 4 ) document.write("<strong class=\"Critical small\">Critical</strong>");
          if ( 4 == 3 ) document.write("<strong class=\"High small\">High</strong>");
          if ( 4 == 2 ) document.write("<strong class=\"Medium small\">Medium</strong>");
          if ( 4 == 1 ) document.write("<strong class=\"Low small\">Low</strong>");
        //--> 
        </script>
        </td>
        <td valign="top" align="center" nowrap>
        <small><script type="text/javascript">document.write(FormatDate("%d-%b-%y", "2012", "11", "18"));</script></small>
        </td>
        <td valign="top" align="center" nowrap><small>
        <a title="CPAI-2012-809" style="text-transform:uppercase" href="2012/cpai-08-nov.html">
        CPAI-2012-809</a></small>
        </td>
        <td valign="top" nowrap align="center"><small>
        <a target="_blank" href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2011-2089">CVE-2011-2089</a><br /></small>
        </td>
        <td valign="top"><small>SCADA ICONICS WebHMI ActiveX Stack Overflow (2011-2089)</small></td>
        </tr>
    
    def start_a(self, attrs):
            if self.is_td:       
                cve_href = [v for k, v in attrs if k == "target" and v == "_blank"]
                if cve_href:
                    self.is_a = True
                    self.is_cve = True
    
                #for SGMLParser maybe have a bug,a <tr> have two <a> has problem
                vul_href = [v for k, v in attrs if k == "style"]
                print vul_href
                if vul_href:
                    vul_href = "".join([v for k, v in attrs if k == "href"])
                    if vul_href.find("cve") == -1:
                        self.href_name = vul_href     
                else:
                    self.href_name = ""

    here print vul_href but print nothing.Is it ok?

    ezio-melotti commented 12 years ago

    Have you tried with HTMLParser? sgmllib is deprecated and has been removed in Python 3. HTMLParser is also much better at parsing (broken) HTML.

    56fc99a9-0679-42bb-a902-02cdc41bf2b9 commented 12 years ago

    I haven't tried it, the problem will not process?

    ezio-melotti commented 12 years ago

    If what you are trying to do is extracting the link(s) that contain 'cve', you try the attached script.

    ezio-melotti commented 12 years ago

    Sorry, I misread your code, looks like you want the href *without* 'cve'. In that case change my code to use "'cve' not in attrs['href']" (also avoid using s.find('cve') == -1 , and use the more readable and idiomatic 'cve' not in s ).

    I think your original script doesn't work for two reasons: 1) you are looking for a table with class="tablesorter", but in the HTML the table doesn't have that class, so self.is_table is never set to True; 2) you are finding the href of the \<a> with a "style" attribute and correctly setting it to self.href_name, but the value is then replaced by "" when the following \<a> without "style" is found;

    That said, I still suggest you to abandon sgmllib and use HTMLParser, or possibly an external module like BeautifulSoup or LXML.