For example, we may have
You may find the [How to code in Java](http://www.randomlink.com) article useful
When we get the sentence text after removing html tags, we will get:
You may find the How to code in Java article useful
I think the above sentence would confuse the corenlp parser, right?
My suggestion would be to preprocess html tags so that the sentence becomes:
You may find the LINK article useful
I played around with this on a very small data sample. You can check the following Google Sheet. Compare the output_nolemm_preproc_code_and_link and output_nolemm_nopreproc sheets. The final output for this small data set is the same, but I think it may affect things in other cases. I'll try to play around with bigger data sets to see how it affects things.
Should we preprocess link text?
For example, we may have
You may find the [How to code in Java](http://www.randomlink.com) article useful
When we get the sentence text after removing html tags, we will get:
You may find the How to code in Java article useful
I think the above sentence would confuse the corenlp parser, right?
My suggestion would be to preprocess html tags so that the sentence becomes:
You may find the LINK article useful
I played around with this on a very small data sample. You can check the following Google Sheet. Compare the
output_nolemm_preproc_code_and_link
andoutput_nolemm_nopreproc
sheets. The final output for this small data set is the same, but I think it may affect things in other cases. I'll try to play around with bigger data sets to see how it affects things.@ctreude what do you think?