sul-dlss / exhibits

Stanford University Libraries online exhibits showcase
https://exhibits.stanford.edu
Other
19 stars 7 forks source link

SALT indexing: extracted_entities datastream data #43

Closed ndushay closed 9 years ago

ndushay commented 9 years ago

So it appears that some of the SALT index data comes from an 'extracted_entities' datastream in the SALT fedora objects.

Where does this information come from, and is this information in the MODS for the items being ingested into DOR? @peetucket

ndushay commented 9 years ago

I'm wondering if this is where some of it comes from. See also https://github.com/sul-dlss/salt/blob/master/lib/tasks/salt.rake

  1. https://github.com/sul-dlss/salt/blob/master/script/background/zotero_directory_watcher.rb this script slurps up zotero files put in a directory (see consul page). It does the following:

    zotero = ZoteroIngest.new(:filename => f) zotero.process_file zotero.save

  2. https://github.com/sul-dlss/salt/blob/master/app/models/zotero_ingest.rb (why is a script using a model in the rails app, with a comment "# this is used to track ingested zotero export files. The ZoteroDirectoryWatcher (script/background/zotero_directory_watcher.rb) deamon kicks off the processes.")

    zotero_parser = Stanford::ZoteroParser.new(@processing_file, self) zotero_parser.process_document

    update_index(zotero_parser.processed_druids) check_data()

    def update_index(druids=[]) index = Stanford::Indexer.new(druids, self) index.process_queue end

  3. https://github.com/sul-dlss/salt/blob/master/lib/stanford/zotero_parser.rb
  4. https://github.com/sul-dlss/salt/blob/master/lib/stanford/indexer.rb#L40

    def process_item(pid) log_message("Indexing item #{pid}") salt_doc = Stanford::SaltDocument.new(pid, { :repository => @repository }) index(salt_doc) end

  5. https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L48

    @datastreams = {} if options[:datastreams].nil? or options[:datastreams] == :default get_datastreams(@pid,["extracted_entities", "zotero"]) else get_datastreams(@pid, options[:datastreams]) end

  6. https://github.com/sul-dlss/salt/blob/master/lib/stanford/zotero_to_json.php --> used to create a hash that is turned into a solr doc:

called by https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L239

called by https://github.com/sul-dlss/salt/blob/master/lib/stanford/salt_document.rb#L123-L127

ndushay commented 9 years ago

2.1.0 :013 > puts repo.get_datastream('druid:rm893cm9160', 'extracted_entities')

  <document>
    <facets>
      <facet type="city" id="116">Austin</facet>
      <facet type="city" id="1811">San Diego</facet>
      <facet type="city" id="1396">New York</facet>
      <facet type="city" id="2287">Carnegie Mellon University</facet>
      <facet type="company" id="10114">Teknowledge Inc.</facet>
   ...

2.1.0 :010 > puts repo.get_datastream('druid:rm893cm9160', 'zotero')

  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:bib="http://purl.org/net/biblio#" xmlns:z="http://www.zotero.org/namespaces/export#" xmlns:link="http://purl.org/rss/1.0/modules/link/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:vcard="http://nwalsh.com/rdf/vCard#" xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/">
           <bib:Manuscript rdf:about="https://saltworks.stanford.edu/documents/druid:rm893cm9160/downloads?download_id=document.pdf">
        <z:itemType>manuscript</z:itemType><dcterms:isReferencedBy rdf:resource="#rm893cm9160"/>
        <dc:subject>PUBLIC</dc:subject>
        <dc:title>In Memoriam: Robert Engelmore</dc:title>
        <dc:date>2003</dc:date>
        <dc:identifier>
            <dcterms:URI>
                <rdf:value>https://saltworks.stanford.edu/documents/druid:rm893cm9160/downloads?download_id=document.pdf</rdf:value>
            </dcterms:URI>
        </dc:identifier>
        <dc:coverage>Box: 28, Folder: 15, Title: Artificial Intelligence (AI) Magazine: In Memory of       Robert Englemore2003</dc:coverage>
        <dc:subject>
           <dcterms:LCC><rdf:value>00008354</rdf:value></dcterms:LCC>
        </dc:subject>
    </bib:Manuscript>
           <bib:Memo about="#rm893cm9160"><rdf:value>Manuscript (2003)</rdf:value></bib:Memo></rdf:RDF>
ndushay commented 9 years ago

Per Scott's email to Peter on October 15, 2015 (copied below), we do not need to worry about the Person, Corporate Entity, City and Company facets.

Hi Peter,

Those facets ("Person", "City", "Corporate Entity", "Company") were derived algorithmically based a small subset of the data, it was only based on about 4000 of the 16000 records. It was intended as an experiment, both in the auto-tagging process, and in the Saltworks GUI. In the end we did not think they were performing very well, as there were lots of near duplicates that needed to be merged, and it did not seem to accomplish much more than a full text search would. So it was never Ed's thinking that these facets would be ported over to the real archive. They can be dropped. The only meaningful keywords are the so-called donor keywords (which may be referred to as "tags" in the spreadsheet).

thanks,

Scott

ndushay commented 9 years ago

as these four facets (Person, City, Corporate Entity and Company) are the only known use of the extracted_entitites data, I believe we can safely ignore this datastream and I am closing this ticket.