Switch parsing to xpath

jyn514 commented 6 years ago

parse.py was hacked together in a very ugly way. This issue is not for rewriting it, although I should probably do that. This is for removing the (count them!) 5 try/except AttributeError blocks in parse_sections.

jyn514 commented 6 years ago

Agghh I remember the problem now. lxml is returning None for elem.tail when it should return a string. I know it shouldn't be None because on line 11550 is this HTML: <SPAN class="fieldlabeltext">Registration Dates: </SPAN>Feb 01, 2018 to Aug 29, 2018

Output of parse.py:

./parse.py --sections < webpages/USC_all_sections.html > .sections.data || { rm -f .sections.data; exit 999; }
DEBUG: CRN 13710 does not have text following. elem: <Element span at 0x7ff13f5f4e48>, elem.text: Registration Dates: , elem.tail: None
make: *** [makefile:44: .sections.data] Error 231

jyn514 commented 6 years ago

Note: this is not fixed by updating to lxml 4.2.1 (2018-03-21)

jyn514 commented 6 years ago

Note: filtering the page through xmllint first has changed the output, will update if this has actually fixed or not

jyn514 commented 6 years ago

Screw it, I'm switching to xpath. The main table can be accessed with //table[@class='datadisplaytable'][1]; each class has two rows: one with a header (th) and one with a body.

jyn514 commented 6 years ago

Done.

jyn514 / GradeForge

Switch parsing to xpath #12