How to load an input from S3?

lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives

161 stars 47 forks source link

This loads a WARC file from local file system: val r: RDD[ArchiveRecord] = RecordLoader.loadArchives(path, sparkConf)

How to load a WARC file from amazon S3?

I found this gist. Is this the correct way to do it? Is it explained somewhere in the Warcbase documentation? https://gist.github.com/afrad/0cb20afa8c76b24a768d6d0acdac6c1d

    val conf = new Configuration()
    conf.set(AWS_ACCESS_KEY_ID, "...")
    conf.set(AWS_SECRET_ACCESS_KEY, "...")
    conf.set(DEFAULT_FS, "s3n://aws-publicdatasets")

    val in = "/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00348-ip-10-236-182-209.ec2.internal.warc.gz"

    val r: RDD[ArchiveRecord] = sc.newAPIHadoopFile(
      in,
      classOf[WacWarcInputFormat],
      classOf[LongWritable],
      classOf[WarcRecordWritable],
      conf)

This works: it connects to Amazon S3 with my access key and loads the file. How does it work? How does it know that it needs to connect to Amazon S3, region us-east-1, where the CommonCrawl dataset is?

If we have a local mirror of the CommonCrawl dataset, in a local S3 storage, how do I change the program below to point it to our storage?

val in = "s3n://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-0000*" val awsAccessKeyId = "AKIAJ6RUSSNHLZWAAAAA" val awsSecretAccessKey = "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" sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem") sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey) val archives = RecordLoader.loadArchives(in, sc)

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.warcbase" % "warcbase-core" % "0.1.0-SNAPSHOT" excludeAll( ExclusionRule(organization = "org.apache.spark"), ExclusionRule(organization = "org.apache.hadoop")), )

lintool / warcbase

How to load an input from S3? #247