lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

NER Workflow & Documentation #168

Closed ianmilligan1 closed 6 years ago

ianmilligan1 commented 8 years ago

Can we write one single Spark script that generates:

Then we can use HTML injection trick to embed it straight into Spark notebook.

Now that this is all in our codebase, let's make sure to get it documented in the Wiki too.

jrwiebe commented 8 years ago

It's at the top of my to-do list. I hope to complete it today.

jrwiebe commented 8 years ago

I didn't finish it yesterday, but the Spark side of this is basically done. See my script here. I see some inefficiencies that I hope to fix, but it works.

I still have to modify the Javascript in vis/ner/index.html to take JSON instead of CSV input.

lintool commented 8 years ago

Probably want to move the script to 'src/'?

ianmilligan1 commented 8 years ago

Just to follow up as @jrwiebe writes documentation. This is the script we cooked up today that did NER extraction on single or multiple WARC/ARCs.

val r = 
RecordLoader.loadArc(arc,
sc) 
.keepMimeTypes(Set("text/html")) 
.discardDate(null) 
.map(r => { 
val t = ExtractRawText(r.getBodyContent)
NER3Classifier("/Users/ianmilligan1/dropbox/ner/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz")
val entities = NER3Classifier.classify(t)
val len = 100
(r.getCrawldate, r.getMimeType, entities, r.getUrl, if ( t.length > len ) t.substring(0, 
len) else t)})
.collect()
jrwiebe commented 8 years ago

This is still a work in progress, but I moved the script into src/, made it into a class, and added it to the matchbox package. The output format will change yet -- at this point, each line of the saved results is a separate JSON, whereas I think all the data should be one huge blob of JSON.

This script invokes the method to classify a collection of (url,domain,content_text) files and save it as JSON:

import org.warcbase.spark.matchbox.NERCombinedJson

sc.addFile("/path/to/english.all.3class.distsim.crf.ser.gz")

val nerJson = new NERCombinedJson

nerJson.classify("english.all.3class.distsim.crf.ser.gz", "/path/to/plaintext/collection", "nerjson/", sc)
jrwiebe commented 8 years ago

The visualizer now works with JSON; I've documented it here. The HTML injection part of this Issue request won't work without further modification to the visualizer script, which I won't have time to do until at least next week.

ianmilligan1 commented 8 years ago

Fantastic, thanks @jrwiebe - keep us posted on the HTML injection part. I see @aliceranzhou has been running into some issues over on #177.

ianmilligan1 commented 8 years ago

Is this still worth pursuing, @jrwiebe – I know we ran into issues with the HTML injection part elsewhere.