timothywangdev / ConversationalAgent

Conversational Agent
1 stars 0 forks source link

URL links in reddit comments should be replaced with token or filtered out #8

Open bwuu opened 9 years ago

bwuu commented 9 years ago

url links look something like "httpvideo nationalgeographic comvideoplayeranimalsinvertebratesanimalsoctopusandsquidoctopuscyanealocomotion html" and obviously not good for training. should replace with token like or filter out

related: maybe need to tokenize/filter gibberish words in general to limit vocab size?

timothywangdev commented 9 years ago

Currently DataGenerator only converts numbers to , feel free to add a few lines to convert url to . We may need to consider phone numbers as well (xxx-xxx-xxxx)