issues
search
iipc
/
webarchive-commons
Common web archive utility code.
Apache License 2.0
49
stars
72
forks
source link
Improve HTML link extraction
#72
Closed
sebastian-nagel
closed
7 years ago
sebastian-nagel
commented
7 years ago
add extractors for more elements which can take URLs as attribute values, add missing attributes, see commoncrawl/ia-web-commons#9 for details
generalize extraction of "global" attributes (
background
)
add custom data attributes frequently used for linking (
data-href
,
data-uri
), cf. commoncrawl/ia-web-commons#7
extend unit tests to cover link extraction
background
)data-href
,data-uri
), cf. commoncrawl/ia-web-commons#7