laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Make sure that #anchors in Urls don't trigger the crawl of a new page #5

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

For example, if the following page was already extracted : https://www.cbanque.com/forums/fil/offre-de-pret-renvoyee-et-appel-de-fonds.36043/page-2

This Url should'nt trigger a new page extraction : https://www.cbanque.com/forums/fil/offre-de-pret-renvoyee-et-appel-de-fonds.36043/page-2#post-305439

Even if this Url with an anchor is hidden behind a redirect (see issue #4 )

laurentprudhon commented 5 years ago

Checked in HyperlinkParser.cs line 104 :

                // Remove the url fragment part of the url if needed.
                // This is the part after the # and is often not useful.
                href = _config.IsRespectUrlNamedAnchorOrHashbangEnabled
                    ? hrefValue
                    : hrefValue.Split('#')[0];