Open dportabella opened 7 years ago
I am currently doing as follows:
val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc)
.keepValidPages()
.map(r => (r.getUrl, r.getContentString))
.reduceByKey { case (contentString1, contentString2) => contentString1 }
...
I try to get rid of duplicate pages as follows:
Any idea? or how to achieve the same objective?