lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

How to load an input from S3? #247

Open dportabella opened 7 years ago

dportabella commented 7 years ago

This loads a WARC file from local file system: val r: RDD[ArchiveRecord] = RecordLoader.loadArchives(path, sparkConf)

How to load a WARC file from amazon S3?

I found this gist. Is this the correct way to do it? Is it explained somewhere in the Warcbase documentation? https://gist.github.com/afrad/0cb20afa8c76b24a768d6d0acdac6c1d

    val conf = new Configuration()
    conf.set(AWS_ACCESS_KEY_ID, "...")
    conf.set(AWS_SECRET_ACCESS_KEY, "...")
    conf.set(DEFAULT_FS, "s3n://aws-publicdatasets")

    val in = "/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00348-ip-10-236-182-209.ec2.internal.warc.gz"

    val r: RDD[ArchiveRecord] = sc.newAPIHadoopFile(
      in,
      classOf[WacWarcInputFormat],
      classOf[LongWritable],
      classOf[WarcRecordWritable],
      conf)

This works: it connects to Amazon S3 with my access key and loads the file. How does it work? How does it know that it needs to connect to Amazon S3, region us-east-1, where the CommonCrawl dataset is?

If we have a local mirror of the CommonCrawl dataset, in a local S3 storage, how do I change the program below to point it to our storage?

dportabella commented 7 years ago

this works for Amazon EC2 :)

val in = "s3n://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-0000*"
val awsAccessKeyId = "AKIAJ6RUSSNHLZWAAAAA"
val awsSecretAccessKey = "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"

sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

val archives = RecordLoader.loadArchives(in, sc)

build.sbt:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.warcbase" % "warcbase-core" % "0.1.0-SNAPSHOT"
    excludeAll(
      ExclusionRule(organization = "org.apache.spark"),
      ExclusionRule(organization = "org.apache.hadoop")),
)

Yet I don't know how to connect to a local hadoop commoncrawl mirror.