Open acruise opened 9 months ago
Hi Alex,
you're right, Sparkling was not designed for S3 in the first place, but HDFS. It might in fact work, with S3 adapters set up properly in Hadoop, but I'm not sure and this would be untested. According to your recent edit, it sounds like it actually did, though?
However, there's also an S3 client built in to Sparkling as well, the pattern for loading WARCs with it would be slightly different then: (also untested, but should work this way, here's an example to print all URLs in a WARC file)
import $ivy.`com.amazonaws:aws-java-sdk:1.7.4` // you'll need Amazon's AWS SDK 1.7.4 in your classpath, this would be the directive if you run it in a Jupyter notebook with Almond, as I usually do
import org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling.warc._
S3Client(accessKey, secretKey).access { s3 =>
s3.open("commoncrawl", "crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz") { in =>
WarcLoader.load(in).flatMap(_.url).foreach(println)
}
}
Also, please note that you're using Sparkling's WARC loader for WAT files here. This works as WAT uses WARC as its container format, but the payload is not an HTTP message as you'd expect in "regular" WARC files.
EDIT: this helped with
Wrong FS
, more tickets incoming ;)sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")
Hey folks, I'm trying to read some common crawl data from S3... See https://github.com/archivesunleashed/aut/issues/556 where I'm using the
aut
pattern, but I get the same symptom using Sparkling by itself: