gklyne / milarq

Automatically exported from code.google.com/p/milarq
0 stars 0 forks source link

Replacing blank dates with 9999 causes subsequent result errors #5

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
When preprocessing of date values leads to a blank or otherwise invalid data, 
it would be better to exclude the date from the output data rather than to 
convert it to "9999", as the incorrect but apparently valid 9999 value can 
cause some unexpected results later.

Unfortunately, because of the way the code is constructed, it is hard to see 
how best to do this without also excluding some valid dates; e.g. sopme of the 
antiquarian photos have just a single data value for claros:not_before, and no 
(i.e. blank) value not not_after; e.g.

{{{
            <crm:E61.Time_Primitive>
              <claros:not_before rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">1930s</claros:not_before>
              <claros:not_after rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear"></claros:not_after>
            </crm:E61.Time_Primitive>
}}}

This is fundamentally a data error (i.e. the blank value for claros:not_after), 
but we need to handle these with some degree of grace.

In this case, I think the data should be treated as if the claros:not_after 
statement is not present in the input data.

This in turn begs the question of what to do with the generated indexes, which 
currently assume not_before and not_after always occur in pairs.  (This is not 
necessarily a requirement on the input data).

For the resulting application, the question becomes one of what a query 
fragment like below should return if the data is not present:

{{{
                ?s claros:subject-not-before ("%(keyword)s" ?beg ?end)
                FILTER ( ?beg >= %(beg)s ) .
                FILTER ( ?end <= %(end)s ) .
}}}

I think the answer is that it should be the same as:
{{{
                ?lit pf:textMatch ('LEKYTHOS') .
                ?s claros:hasLiteral ?lit .
                ?s crm:P108I.was_produced_by
                  [ crm:P4.has_time-span
                    [ crm:P82.at_some_time_within
                      [ claros:not_before ?beg ;
                        claros:not_after ?end ;
                      ]
                    ]
                  ] .
              FILTER ( ?beg >= %(beg)s ) .
              FILTER ( ?end <= %(end)s ) .
}}}

which would be that no result is returned for entries that don't define both 
start and end dates.

But there's another case to consider: what should a query fragment like this 
should return if the ?end data is not present:
{{{
                ?s claros:subject-not-before ("%(keyword)s" ?beg ?end)
                FILTER ( ?beg >= %(beg)s ) .
}}}

Foll9owiung the same line, I think the answer is that it should be the same as:
{{{
                ?lit pf:textMatch ('LEKYTHOS') .
                ?s claros:hasLiteral ?lit .
                ?s crm:P108I.was_produced_by
                  [ crm:P4.has_time-span
                    [ crm:P82.at_some_time_within
                      [ claros:not_before ?beg ;
                      ]
                    ]
                  ] .
              FILTER ( ?beg >= %(beg)s ) .
}}}

i.e. the absence of the ?end value should not affect the result.

....

Full example record is:
{{{
<crm:E22.Man-Made_Object 
rdf:about="http://www.beazley.ox.ac.uk/record/254909D6-5A1D-4919-80D5-2B607F2DBD
08">
  <rdfs:label>Photograph 478</rdfs:label>
  <crm:P102.has_title>
    <crm:E35.Title>
      <rdf:value>Photograph 478</rdf:value>
    </crm:E35.Title>
  </crm:P102.has_title>
  <crm:P2.has_type>
    <crm:E55.Type>
      <rdf:value>Photograph</rdf:value>
      <crm:P127.has_broader_term rdf:resource="http://purl.org/NET/Claros/vocab#ObjectType" />
    </crm:E55.Type>
  </crm:P2.has_type>
  <crm:P48.has_preferred_identifier>
    <crm:E42.Identifier>
      <rdf:value>Photograph 478</rdf:value>
    </crm:E42.Identifier>
  </crm:P48.has_preferred_identifier>
  <crm:P108I.was_produced_by>
    <crm:E12.Production>
      <rdfs:label>Production of Photograph 478</rdfs:label>
      <crm:P126.employed>
        <crm:E57.Material>
          <rdfs:label></rdfs:label>
        </crm:E57.Material>
      </crm:P126.employed>
      <crm:P4.has_time-span>
        <crm:E52.Time-Span>
          <crm:P82.at_some_time_within>
            <crm:E61.Time_Primitive>
              <claros:not_before rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">1930s</claros:not_before>
              <claros:not_after rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear"></claros:not_after>
            </crm:E61.Time_Primitive>
          </crm:P82.at_some_time_within>
        </crm:E52.Time-Span>
      </crm:P4.has_time-span>
    </crm:E12.Production>
  </crm:P108I.was_produced_by>
  <crm:P14I.was_classified_by>
    <crm:E17.Type_Assignment>
      <crm:P42.assigned>
        <crm:E55.Type>
          <rdfs:label>DUNBABIN ARCHIVE</rdfs:label>
        </crm:E55.Type>
      </crm:P42.assigned>
      <crm:P42.assigned>
        <crm:E55.Type>
          <rdfs:label>A</rdfs:label>
        </crm:E55.Type>
      </crm:P42.assigned>
    </crm:E17.Type_Assignment>
  </crm:P14I.was_classified_by>
  <crm:P53.has_former_or_current_location>
    <crm:E53.Place>
      <rdfs:label>GREECE, ATTICA, RHAMNOUS</rdfs:label>
      <crm:P87.is_identified_by>
        <crm:E48.Place_Name>
          <rdf:value>GREECE, ATTICA, RHAMNOUS</rdf:value>
        </crm:E48.Place_Name>
      </crm:P87.is_identified_by>
    </crm:E53.Place>
  </crm:P53.has_former_or_current_location>
  <crm:P53.has_former_or_current_location>
    <crm:E53.Place>
      <rdfs:label>RHAMNOUS, SANCTUARY OF NEMESIS</rdfs:label>
      <crm:P87.is_identified_by>
        <crm:E48.Place_Name>
          <rdf:value>RHAMNOUS, SANCTUARY OF NEMESIS</rdf:value>
        </crm:E48.Place_Name>
      </crm:P87.is_identified_by>
    </crm:E53.Place>
  </crm:P53.has_former_or_current_location>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478.D/cc001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478.C/cc001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478.B/cc001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478.a/cc001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478/cc001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P138I.has_representation>
    <crm:E38.Image rdf:about="http://www.beazley.ox.ac.uk/Photography/SPIFF/newx/478.a/ac001001.jpe">
      <rdfs:label>Image of Photograph 478</rdfs:label>
      <crm:P2.has_type rdf:resource="&claros;Thumbnail" />
    </crm:E38.Image>
  </crm:P138I.has_representation>
  <crm:P70I.is_documented_in rdf:resource="http://www.beazley.ox.ac.uk/record/254909D6-5A1D-4919-80D5-2B607F2DBD08" />
</crm:E22.Man-Made_Object>

Original issue reported on code.google.com by gk-goo...@ninebynine.org on 13 Sep 2010 at 8:59

GoogleCodeExporter commented 9 years ago
Proposal. Allow the index file to have blank fields for values; such a field 
means that the corresponding variable remains unbound. A typical filter will 
fail on the variable, which is appropriate. 

Needs update to genericIndex to act on blank fields. Needs update to 
generate_sorted_indexes to create blank fields, specifically for dates. May 
need some form of templating so that it knows what CSV fields are to be treated 
in what way. Note that empty strings would have to be represented in the index 
as "" and that genericIndex will have to understand such strings if it doesn't 
already.

Original comment by ehog.he...@googlemail.com on 17 Sep 2010 at 10:00

GoogleCodeExporter commented 9 years ago

Original comment by gk-goo...@ninebynine.org on 30 Sep 2010 at 3:00