Closed jyn514 closed 6 years ago
Agghh I remember the problem now. lxml
is returning None
for elem.tail
when it should return a string. I know it shouldn't be None because on line 11550 is this HTML:
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Feb 01, 2018 to Aug 29, 2018
Output of parse.py
:
./parse.py --sections < webpages/USC_all_sections.html > .sections.data || { rm -f .sections.data; exit 999; }
DEBUG: CRN 13710 does not have text following. elem: <Element span at 0x7ff13f5f4e48>, elem.text: Registration Dates: , elem.tail: None
make: *** [makefile:44: .sections.data] Error 231
Note: this is not fixed by updating to lxml 4.2.1 (2018-03-21)
Note: filtering the page through xmllint first has changed the output, will update if this has actually fixed or not
Screw it, I'm switching to xpath. The main table can be accessed with //table[@class='datadisplaytable'][1]
; each class has two rows: one with a header (th
) and one with a body.
parse.py
was hacked together in a very ugly way. This issue is not for rewriting it, although I should probably do that. This is for removing the (count them!) 5 try/except AttributeError blocks in parse_sections.