Replace `pstore` with Relaton XML storage

ronaldtse commented 6 years ago

@opoudjis @andrew2net

We wish to replace the current pstore caches used in all Relaton data sources (ISO, IEC, IETF etc) with Relaton XML storage.

The current pstore cache is simply not portable and hard to maintain in source code repositories. I want to maintain each Relaton entry as a separate file.

Each Relaton entry should be stored in a separate file. i.e. if we found "ISO 9001:2015" then within ~/.relaton/iso/ there should be ~/.relaton/iso/iso-9001-2015.xml (whether the iso "type" is a subdirectory is subject to consideration)
Metadata about each entry (such as when was the last time it was fetched) still needs to be stored somewhere else, which could be ~/.relaton/iso/index.xml.
Moving entries between the the Global cache and Local cache is easy, just a file replacement. If we disable to Global cache, all entries will be stored in Local cache. If Local cache is disabled, we store all in the Global cache. If both caches are active, both caches should be updated to the latest entries.

Please help plan this out. Thanks.

andrew2net commented 6 years ago

@ronaldtse Ok, I will do it a few days late. Have to finish some work.

opoudjis commented 6 years ago

As a consequence, the from_xml routine needs to be robust and tested for each biblio class.

andrew2net commented 6 years ago

if we need to recreate an item from XML then we should store the class name of the item in the cache. I see 2 ways how to do this:

When a processor registering in Relation then it should store class name

module Relaton
module Isobib
class Processor < Relaton::Processor

  def initialize
    @short = :isobib
    @class_name = 'IsoBibItem'
    @prefix = "ISO"
    @defaultprefix = %r{^(ISO)[ /]|^IEV($| )|^IEC 60050}
    @idtype = "ISO"
  end

So in Relaton we can associate class with item and save the class name in the cache.

Return from IsoBib, GbBib etc, objects instead of XML strings. Then we could get class name from the objects and save the class name with serialized objects in XML.

Any thoughts?

ronaldtse commented 6 years ago

@andrew2net we do not want to store the "class name" because those are subject to change.

The prefix (or author information) will already tell us what class we need to instantiate, right?

andrew2net commented 6 years ago

@ronaldtse Not right. Caching is in Relaton gem. Other gems like IsoBib, GbBib registers themselves in Relaton and provide prefix and method to get item. The prefix allows recognizing related reference. So we can get an item in XML format and store it in the cache. But when we get the item from the cache we don't know which class use to create the item's object from XML. If we coding association prefix = class in Relaton then we can't register new gems in Relaton without changing the code. So a gem, which registers in Relaton should provide a class name.

ronaldtse commented 6 years ago

@andrew2net this must be a misunderstanding. There should not be a class name inside this cache. This is not a binary cache, it is a bibliographic entry that will be stored in XML format.

In order to know which class to instantiate from the XML, the "type" of the object is stored in the XML itself. For example, if the entry is:

<bibdata type="uri:calconnect.org:documents:standard">     <==== the type that tells you
  <title language="en" format="plain">Guidelines to thwart calendar abuse for calendaring and mail system operators</title>
  <docidentifier>CD 18XX</docidentifier>
  <contributor>
    <role type="author"/>
    <organization>
      <name>CalConnect</name>
    </organization>
  </contributor>
  <contributor>
    <role type="publisher"/>
    <organization>
      <name>CalConnect</name>
    </organization>
  </contributor>
  <language>en</language>
  <script>Latn</script>
  <status format="plain">working-draft</status>
  <copyright>
    <from>2018</from>
    <owner>
      <organization>
        <name>CalConnect</name>
      </organization>
    </owner>
  </copyright>
  <editorialgroup>
    <technical-committee>CALSPAM</technical-committee>
  </editorialgroup>
</bibdata>

We can infer the type from the "type" attribute.

cc: @opoudjis (see how the type incorporates the namespace)

opoudjis commented 6 years ago

Eh.... the type currently does not incorporate a namespace; it is just a token like "standard". And I don't think it should have a namespace.

The class is being inferred from the document identifier prefix by default, but that will not generalise in all cases (particularly GB). The author or publisher contributor could be used as well, but that leads to a complex and fallible rules engine.

Hate to say this, but the cleanest way to address this is to add a new top-level attribute to all bibdata retrieved from relaton, such as "source", naming the class it was derived through. @ronaldtse, is that OK?

andrew2net commented 6 years ago

@ronaldtse we have to store also fetched date in the cache, so I use yaml files

---
fetched: 2018-10-01
bib: |-
  <bibitem type="standard" id="GB/T20223">
    <title format="text/plain" language="zh" script="Hans">棉短绒</title>
    <title format="text/plain" language="en" script="Latn">Cotton linter</title>
...

It's not problem to add class_name attribute

@opoudjis I think the easiest way would be adding to_xml method to the processor interface. What do you think?

ronaldtse commented 6 years ago

I strongly urge the usage of some type or namespace. For example, <bibdata type="uri:calconnect.org:documents:standard"> totally makes sense and is not an xmlns (which everyone hates).

We can store the fetched date in the <bibitem> as well, because it is valid (in citations we often have "last accessed" date too).

And we want the files to be usable as standalone files as well.

andrew2net commented 6 years ago

So we need 2 new attributes in bibdata (class_name and fetched), do we?

ronaldtse commented 6 years ago

I think fetched should be an element, and class-name should really be 'type' for the “uri:...”

andrew2net commented 6 years ago

Done it. @opoudjis I added method from_xml to the processor and did some fixes in IsoBibItem and *bib gems. So we need to republish Relaton, IsoBibItem, IsoBib, GbBib, and IETFBib gems.

relaton / relaton-iso

Replace `pstore` with Relaton XML storage #36