sandboxnu / major-scraper

Scraping Northeastern's Academic Catalog for use in GraduateNU.
GNU General Public License v3.0
3 stars 0 forks source link

Address comments that are not tokenized as `XOM` #16

Open rael346 opened 8 months ago

rael346 commented 8 months ago

Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a COMMENT token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into the comments.json and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)