Preprocess Link Text? - Githubissues

Should we preprocess link text?

For example, we may have You may find the [How to code in Java](http://www.randomlink.com) article useful

When we get the sentence text after removing html tags, we will get:

You may find the How to code in Java article useful

I think the above sentence would confuse the corenlp parser, right?

My suggestion would be to preprocess html tags so that the sentence becomes:

You may find the LINK article useful

I played around with this on a very small data sample. You can check the following Google Sheet. Compare the output_nolemm_preproc_code_and_link and output_nolemm_nopreproc sheets. The final output for this small data set is the same, but I think it may affect things in other cases. I'll try to play around with bigger data sets to see how it affects things.

@ctreude what do you think?

ualberta-smr / StackOverflowNavCues

Preprocess Link Text? #6