ualberta-smr / StackOverflowNavCues

This repository holds the code related to extracting essential sentences for navigating Stack Overflow answers
2 stars 0 forks source link

Preprocess Link Text? #6

Closed snadi closed 5 years ago

snadi commented 5 years ago

Should we preprocess link text?

For example, we may have You may find the [How to code in Java](http://www.randomlink.com) article useful

When we get the sentence text after removing html tags, we will get:

You may find the How to code in Java article useful

I think the above sentence would confuse the corenlp parser, right?

My suggestion would be to preprocess html tags so that the sentence becomes:

You may find the LINK article useful

I played around with this on a very small data sample. You can check the following Google Sheet. Compare the output_nolemm_preproc_code_and_link and output_nolemm_nopreproc sheets. The final output for this small data set is the same, but I think it may affect things in other cases. I'll try to play around with bigger data sets to see how it affects things.

@ctreude what do you think?

ctreude commented 5 years ago

Hi @snadi Yes, I think it's a good idea to preprocess links in the way that you propose.