Closed ianmilligan1 closed 6 years ago
It's at the top of my to-do list. I hope to complete it today.
I didn't finish it yesterday, but the Spark side of this is basically done. See my script here. I see some inefficiencies that I hope to fix, but it works.
I still have to modify the Javascript in vis/ner/index.html
to take JSON instead of CSV input.
Probably want to move the script to 'src/'?
Just to follow up as @jrwiebe writes documentation. This is the script we cooked up today that did NER extraction on single or multiple WARC/ARCs.
val r =
RecordLoader.loadArc(arc,
sc)
.keepMimeTypes(Set("text/html"))
.discardDate(null)
.map(r => {
val t = ExtractRawText(r.getBodyContent)
NER3Classifier("/Users/ianmilligan1/dropbox/ner/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz")
val entities = NER3Classifier.classify(t)
val len = 100
(r.getCrawldate, r.getMimeType, entities, r.getUrl, if ( t.length > len ) t.substring(0,
len) else t)})
.collect()
This is still a work in progress, but I moved the script into src/
, made it into a class, and added it to the matchbox package. The output format will change yet -- at this point, each line of the saved results is a separate JSON, whereas I think all the data should be one huge blob of JSON.
This script invokes the method to classify a collection of (url,domain,content_text) files and save it as JSON:
import org.warcbase.spark.matchbox.NERCombinedJson
sc.addFile("/path/to/english.all.3class.distsim.crf.ser.gz")
val nerJson = new NERCombinedJson
nerJson.classify("english.all.3class.distsim.crf.ser.gz", "/path/to/plaintext/collection", "nerjson/", sc)
The visualizer now works with JSON; I've documented it here. The HTML injection part of this Issue request won't work without further modification to the visualizer script, which I won't have time to do until at least next week.
Fantastic, thanks @jrwiebe - keep us posted on the HTML injection part. I see @aliceranzhou has been running into some issues over on #177.
Is this still worth pursuing, @jrwiebe – I know we ran into issues with the HTML injection part elsewhere.
Can we write one single Spark script that generates:
Then we can use HTML injection trick to embed it straight into Spark notebook.
Now that this is all in our codebase, let's make sure to get it documented in the Wiki too.