internetarchive / Sparkling

Internet Archive's Sparkling Data Processing Library
MIT License
11 stars 2 forks source link

`s3a` URLs don't work in `WarcLoader` (`Wrong FS: s3a://...`) #3

Open acruise opened 6 months ago

acruise commented 6 months ago

EDIT: this helped with Wrong FS, more tickets incoming ;)

sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")

Hey folks, I'm trying to read some common crawl data from S3... See https://github.com/archivesunleashed/aut/issues/556 where I'm using the aut pattern, but I get the same symptom using Sparkling by itself:

bin/spark-shell --jars ~/dev/Sparkling/target/scala-2.12/sparkling-assembly-0.3.8-SNAPSHOT.jar --packages com.amazonaws:aws-java-sdk:1.12.662,org.apache.hadoop:hadoop-aws:3.3.4

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.73:4040
Spark context available as 'sc' (master = local[*], app id = local-1708561219560).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.2
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.21)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.archive.webservices.sparkling._, org.archive.webservices.sparkling.warc._, org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling._
import org.archive.webservices.sparkling.warc._
import org.archive.webservices.sparkling.io._

scala> val warcs = WarcLoader.load("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz")
24/02/21 16:20:25 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
java.lang.IllegalArgumentException: Wrong FS: s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105)
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
  at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:115)
  at org.apache.hadoop.fs.Globber.doGlob(Globber.java:349)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)
  at org.archive.webservices.sparkling.io.HdfsIO.files(HdfsIO.scala:163)
  at org.archive.webservices.sparkling.util.RddUtil$.loadFilesLocality(RddUtil.scala:74)
  at org.archive.webservices.sparkling.util.RddUtil$.loadBinary(RddUtil.scala:125)
  at org.archive.webservices.sparkling.warc.WarcLoader$.load(WarcLoader.scala:56)
  ... 53 elided
helgeho commented 6 months ago

Hi Alex,

you're right, Sparkling was not designed for S3 in the first place, but HDFS. It might in fact work, with S3 adapters set up properly in Hadoop, but I'm not sure and this would be untested. According to your recent edit, it sounds like it actually did, though?

However, there's also an S3 client built in to Sparkling as well, the pattern for loading WARCs with it would be slightly different then: (also untested, but should work this way, here's an example to print all URLs in a WARC file)

import $ivy.`com.amazonaws:aws-java-sdk:1.7.4` // you'll need Amazon's AWS SDK 1.7.4 in your classpath, this would be the directive if you run it in a Jupyter notebook with Almond, as I usually do

import org.archive.webservices.sparkling.io._
import org.archive.webservices.sparkling.warc._

S3Client(accessKey, secretKey).access { s3 =>
    s3.open("commoncrawl", "crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/wat/CC-MAIN-20231211210408-20231212000408-00881.warc.wat.gz") { in =>
        WarcLoader.load(in).flatMap(_.url).foreach(println)
    }
}

Also, please note that you're using Sparkling's WARC loader for WAT files here. This works as WAT uses WARC as its container format, but the payload is not an HTTP message as you'd expect in "regular" WARC files.