Hi, @mikzolot
I'm just checking the parser part you built.
There seems to be some edge cases:
Canadian-American is parsed into 'canadianamerican'.
also 'stextstylefracnsumyibary' seems like a result of deleting hyphens, though I couldn't back track this in text.
etc
I guess it all boils down to this line:
article = re.sub(r"[^a-z\s]", "", article)
I couldn't find exaclty what happens with instances like "California.", "opposed;" etc where there is not a space but another character after the word - but it seems that the parser will just skip them.
Hi, @mikzolot I'm just checking the parser part you built.
There seems to be some edge cases:
I guess it all boils down to this line: article = re.sub(r"[^a-z\s]", "", article)
I couldn't find exaclty what happens with instances like "California.", "opposed;" etc where there is not a space but another character after the word - but it seems that the parser will just skip them.