Sorry, I misread your code, looks like you want the href *without* 'cve'.
In that case change my code to use "'cve' not in attrs['href']" (also avoid using s.find('cve') == -1 , and use the more readable and idiomatic 'cve' not in s ).
I think your original script doesn't work for two reasons:
1) you are looking for a table with class="tablesorter", but in the HTML the table doesn't have that class, so self.is_table is never set to True;
2) you are finding the href of the \<a> with a "style" attribute and correctly setting it to self.href_name, but the value is then replaced by "" when the following \<a> without "style" is found;
That said, I still suggest you to abandon sgmllib and use HTMLParser, or possibly an external module like BeautifulSoup or LXML.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['invalid', 'type-bug', 'library']
title = 'SGMLParser processing which include two will have problem'
updated_at =
user = 'https://bugs.python.org/moonflow'
```
bugs.python.org fields:
```python
activity =
actor = 'ezio.melotti'
assignee = 'none'
closed = True
closed_date =
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation =
creator = 'moonflow'
dependencies = []
files = ['28049', '28050']
hgrepos = []
issue_num = 16513
keywords = []
message_count = 5.0
messages = ['175990', '175992', '175993', '175994', '175995']
nosy_count = 2.0
nosy_names = ['ezio.melotti', 'moonflow']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue16513'
versions = ['Python 2.7']
```
56fc99a9-0679-42bb-a902-02cdc41bf2b9
commented
12 years ago
ezio-melotti
commented
12 years ago
56fc99a9-0679-42bb-a902-02cdc41bf2b9
commented
12 years ago
ezio-melotti
commented
12 years ago
ezio-melotti
commented
12 years ago
- © Githubissues.
- Githubissues is a development platform for aggregating issues.
if a \<tr> include two \<a> or more,SGMLParser processing has a problem
here print vul_href but print nothing.Is it ok?
Have you tried with HTMLParser? sgmllib is deprecated and has been removed in Python 3. HTMLParser is also much better at parsing (broken) HTML.
I haven't tried it, the problem will not process?
If what you are trying to do is extracting the link(s) that contain 'cve', you try the attached script.
Sorry, I misread your code, looks like you want the href *without* 'cve'. In that case change my code to use "'cve' not in attrs['href']" (also avoid using s.find('cve') == -1 , and use the more readable and idiomatic 'cve' not in s ).
I think your original script doesn't work for two reasons: 1) you are looking for a table with class="tablesorter", but in the HTML the table doesn't have that class, so self.is_table is never set to True; 2) you are finding the href of the \<a> with a "style" attribute and correctly setting it to self.href_name, but the value is then replaced by "" when the following \<a> without "style" is found;
That said, I still suggest you to abandon sgmllib and use HTMLParser, or possibly an external module like BeautifulSoup or LXML.