Closed ndushay closed 9 years ago
I'm wondering if this is where some of it comes from. See also https://github.com/sul-dlss/salt/blob/master/lib/tasks/salt.rake
https://github.com/sul-dlss/salt/blob/master/script/background/zotero_directory_watcher.rb this script slurps up zotero files put in a directory (see consul page). It does the following:
zotero = ZoteroIngest.new(:filename => f) zotero.process_file zotero.save
https://github.com/sul-dlss/salt/blob/master/app/models/zotero_ingest.rb (why is a script using a model in the rails app, with a comment "# this is used to track ingested zotero export files. The ZoteroDirectoryWatcher (script/background/zotero_directory_watcher.rb) deamon kicks off the processes.")
zotero_parser = Stanford::ZoteroParser.new(@processing_file, self) zotero_parser.process_document
update_index(zotero_parser.processed_druids) check_data()
def update_index(druids=[]) index = Stanford::Indexer.new(druids, self) index.process_queue end
https://github.com/sul-dlss/salt/blob/master/lib/stanford/indexer.rb#L40
def process_item(pid) log_message("Indexing item #{pid}") salt_doc = Stanford::SaltDocument.new(pid, { :repository => @repository }) index(salt_doc) end
https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L48
@datastreams = {} if options[:datastreams].nil? or options[:datastreams] == :default get_datastreams(@pid,["extracted_entities", "zotero"]) else get_datastreams(@pid, options[:datastreams]) end
called by https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L239
called by https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L123-L127
2.1.0 :013 > puts repo.get_datastream('druid:rm893cm9160', 'extracted_entities')
<document>
<facets>
<facet type="city" id="116">Austin</facet>
<facet type="city" id="1811">San Diego</facet>
<facet type="city" id="1396">New York</facet>
<facet type="city" id="2287">Carnegie Mellon University</facet>
<facet type="company" id="10114">Teknowledge Inc.</facet>
...
2.1.0 :010 > puts repo.get_datastream('druid:rm893cm9160', 'zotero')
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:bib="http://purl.org/net/biblio#" xmlns:z="http://www.zotero.org/namespaces/export#" xmlns:link="http://purl.org/rss/1.0/modules/link/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:vcard="http://nwalsh.com/rdf/vCard#" xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/">
<bib:Manuscript rdf:about="https://saltworks.stanford.edu/documents/druid:rm893cm9160/downloads?download_id=document.pdf">
<z:itemType>manuscript</z:itemType><dcterms:isReferencedBy rdf:resource="#rm893cm9160"/>
<dc:subject>PUBLIC</dc:subject>
<dc:title>In Memoriam: Robert Engelmore</dc:title>
<dc:date>2003</dc:date>
<dc:identifier>
<dcterms:URI>
<rdf:value>https://saltworks.stanford.edu/documents/druid:rm893cm9160/downloads?download_id=document.pdf</rdf:value>
</dcterms:URI>
</dc:identifier>
<dc:coverage>Box: 28, Folder: 15, Title: Artificial Intelligence (AI) Magazine: In Memory of Robert Englemore2003</dc:coverage>
<dc:subject>
<dcterms:LCC><rdf:value>00008354</rdf:value></dcterms:LCC>
</dc:subject>
</bib:Manuscript>
<bib:Memo about="#rm893cm9160"><rdf:value>Manuscript (2003)</rdf:value></bib:Memo></rdf:RDF>
Per Scott's email to Peter on October 15, 2015 (copied below), we do not need to worry about the Person, Corporate Entity, City and Company facets.
Hi Peter,
Those facets ("Person", "City", "Corporate Entity", "Company") were derived algorithmically based a small subset of the data, it was only based on about 4000 of the 16000 records. It was intended as an experiment, both in the auto-tagging process, and in the Saltworks GUI. In the end we did not think they were performing very well, as there were lots of near duplicates that needed to be merged, and it did not seem to accomplish much more than a full text search would. So it was never Ed's thinking that these facets would be ported over to the real archive. They can be dropped. The only meaningful keywords are the so-called donor keywords (which may be referred to as "tags" in the spreadsheet).
thanks,
Scott
as these four facets (Person, City, Corporate Entity and Company) are the only known use of the extracted_entitites data, I believe we can safely ignore this datastream and I am closing this ticket.
So it appears that some of the SALT index data comes from an 'extracted_entities' datastream in the SALT fedora objects.
Where does this information come from, and is this information in the MODS for the items being ingested into DOR? @peetucket