rapid7 / godap

The Data Analysis Pipeline
MIT License
17 stars 10 forks source link

Add warc support to godap? #14

Open ssikdar1 opened 5 years ago

ssikdar1 commented 5 years ago

Currently if someone was to download this warc file from dap:

https://github.com/rapid7/dap/blob/master/samples/iawide.warc.bz2

The could parse this with dap:

$ bzcat iawide.warc.bz2 | dap warc  + json | head -n 1 | jq  'keys'
[
  "content",
  "content_length",
  "content_type",
  "warc_date",
  "warc_ip_address",
  "warc_payload_digest",
  "warc_record_id",
  "warc_target_uri",
  "warc_type"
]

However if they were to try this with godap:

$ bzcat iawide.warc.bz2 | ./dappy warc  + json | head -n 1 | jq  'keys'
bzcat: Can't open input file iawide.warc.bz2: No such file or directory.
Error: Invalid input plugin: warc

  Usage: ./dappy [input] + [filter] + [output]
       --inputs
       --outputs
       --filters

Example: echo world | ./dappy lines stdin + rename line=hello + json stdout

Looking at supported types:

$ ./dappy --inputs
Inputs:
 * json
 * lines

It's not there.

We could use this golang library: https://github.com/slyrz/warc

Which actually supports reading compressed warc files.

A sample script:

package main

import (
   "fmt"
   "github.com/slyrz/warc"
   "os"
   "encoding/json"
   "bytes"
)

type godapWarc struct {
   Type string `json:"warc_type"`
   TargetUri string `json:"warc_target_uri"`
   Id string `json:"warc_record_id"`
   ContentLength string `json:"content_length"`
   Date string `json:"warc_date"`
   ContentType string `json:"content_type"`
   PayloadDigest string `json:"warc_payload_digest"`
   IpAddress string `json:"warc_ip_address"`
   Content string `json:"content"`
}

func main(){

reader, err := warc.NewReader(os.Stdin)
if err != nil {
    panic(err)
}
defer reader.Close()

for {
    record, err := reader.ReadRecord()
    if err != nil {
        break
    }
   buf := new(bytes.Buffer)
   buf.ReadFrom(record.Content)
   warc_rec := &godapWarc{
      Type: record.Header["warc-type"],
      TargetUri: record.Header["warc-target-uri"],
      Id: record.Header["warc-record-id"],
      ContentLength: record.Header["content-length"],
      ContentType: record.Header["content-type"],
      Date: record.Header["warc-date"],
      PayloadDigest: record.Header["warc-payload-digest"],
      IpAddress: record.Header["warc-ip-address"],
      Content:  buf.String(),
    }
    warc_jsonstr, _ := json.Marshal(warc_rec)
    fmt.Println(string(warc_jsonstr))
}

}

Which when run:

$ cat ~/Downloads/iawide.warc.bz2  | go run warc_reader2.go  | head -n 1 | jq 'keys'
[
  "content",
  "content_length",
  "content_type",
  "warc_date",
  "warc_ip_address",
  "warc_payload_digest",
  "warc_record_id",
  "warc_target_uri",
  "warc_type"
]

$ cat ~/Downloads/iawide.warc.bz2  | go run warc_reader2.go  | head -n 1
{"warc_type":"warcinfo","warc_target_uri":"","warc_record_id":"\u003curn:uuid:88fbcbee-f24e-47c1-b0c4-f7a9530ceb74\u003e","content_length":"442","warc_date":"2011-02-25T18:32:19Z","content_type":"application/warc-fields","warc_payload_digest":"","warc_ip_address":"","content":"software: Heritrix/3.0.1-SNAPSHOT-20110127.213729 http://crawler.archive.org\r\nip: 207.241.232.79\r\nhostname: crawl301.us.archive.org\r\nformat: WARC File Format 1.0\r\nconformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\noperator: kenji@archive.org\r\nisPartOf: wide\r\ndescription: seeds.txt\r\nrobots: obey\r\nhttp-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)\r\n\r\n"}

Can spit out similar content to dap.

Maybe we could add this to filters part of the factory?