holtzermann17 / planetmath-docs

Documentation for the PlanetMath website and organization to be read in combination with the Planetary issue tracker.
https://github.com/KWARC/planetary/issues?labels=&milestone=&page=1&state=open
7 stars 1 forks source link

PlanetMath metadata #40

Closed dginev closed 11 years ago

dginev commented 11 years ago

I am reindexing PlanetMath.org with NNexus today, trying to get all issues sorted out. However, the metadata still seems shaky and I am not sure how to index it best. Here is a good example that summarizes most of my concerns:

http://planetmath.org/isoscelestriangle

We see the following metadata:

<div class="ltx_rdf" property="dct:identifier" content="IsoscelesTriangle"/>
<div class="ltx_rdf" property="dct:created" datatype="xsd:date" content="2013-03-21 12:22:02"/>
<div class="ltx_rdf" property="dct:modified" datatype="xsd:date" content="2013-03-21 12:22:02"/>
<div class="ltx_rdf" resource="pmuser:drini" property="pm:owner"/>
<div class="ltx_rdf" resource="pmuser:drini" property="pm:modifier"/>
<div class="ltx_rdf" property="dct:title" content="isosceles triangle"/>
<div class="ltx_rdf" property="dct:hasVersion" content="16"/>
<div class="ltx_rdf" property="pm:privacy" datatype="xsd:integer" content="1"/>
<div class="ltx_rdf" resource="pmuser:drini" property="dct:creator"/>
<div class="ltx_rdf" property="dct:type" content="Definition"/>
<div class="ltx_rdf" resource="msc:51-00" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49J20" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49J30" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49-01" property="dct:subject"/>
<div class="ltx_rdf" about="pmconcept:IsoscelesTriangle" property="pm:synonym" content="isosceles"/>
<div class="ltx_rdf" resource="pmarticle:Triangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:RightTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquilateralTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquivalentConditionsForTriangles" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquiangularTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:RegularTriangle" property="pm:related"/>
<div class="ltx_rdf" property="pm:defines" content="pmconcept:base angle"/>
<div class="ltx_rdf" property="pm:defines" content="pmconcept:vertex angle"/>

The article has pm:title _isosceles triangle_, and provides a synonym _isosceles_ to that extent. However, it also pm:defines two other concepts, namely _base angle_ and _vertex angle_.

I was firstly thinking to only index articles that have pm:defines but clearly, that would omit this article, which also defines its pm:title. Then again, there are articles that really don't define anything, such as this one which I don't want to index (as expected they don't have any pm:defines). But maybe I have to live with some junk in the index...

Is this sane:

That would cover the triangle article and I will have to live with the junk from the other article, it is in any case too specific to ever get linked against. We should have a metadata-curation initiative for the PM articles at some point.

dginev commented 11 years ago

Oh, there are many examples of articles that define concepts but don't use pm:defines as they seemingly assume their pm:title will be indexed as a concept name. Here is one example.

holtzermann17 commented 11 years ago

@dginev - pm:title as defined term is indeed the legacy way of thinking about things. pm:defines is for extra definitions that aren't equivalent to the title, whereas pm:synonym is for extra terms that are equivalent to the title.

In short: Your plan in bullet points above does seem like the right thing to do.

dginev commented 11 years ago

I am working on that... another addition is that I will skip any definitions without an MSC class specified, as they cause more problems. Or should I instead link them to an arbitrary top-level class, e.g. 00-XX for general?

(there is a lot of broken metadata fields on the PM site right now btw, currently making my indexing robust to guard against them. I suspect I should then reinforce LaTeXML as well to not produce garbage metadata)

dginev commented 11 years ago

Some synonyms use TeX's math syntax to try and specify partial or entire math expressions as synonyms. I have currently updated my indexer to ignore such entries, in the long run we should convert them to MathML via LaTeXML and have them ready to be indexed in MathML. But that's a late summer task.

holtzermann17 commented 11 years ago

Rather than skipping definitions without MSC, how about creating a category called XX-XX and assigning them to that?

dginev commented 11 years ago

It needs to be a real category, otherwise it would definitely confuse the disambiguation mechanism... But I agree that it would be nice not to lose concepts because of missing classification.

dginev commented 11 years ago

Then again the disambiguation can just have a custom rule that inspects the XX-XX and it is unambiguously marking up "no category", so... OK, I accept your suggestion.

dginev commented 11 years ago

Btw, I am currently indexing PlanetMath.org, so let me know if it bogs down the server too much - if so I will space out my requests.

holtzermann17 commented 11 years ago

I noticed it was a little slower than usual but not SO bad. It reminded me to ask Constantin about some Javascript fixes (https://github.com/KWARC/planetary/issues/356).

kohlhase commented 11 years ago

On 18.4.13 21:00, Deyan Ginev wrote:

(there is a lot of broken metadata fields on the PM site right now btw, currently making my indexing robust to guard against them. I suspect I should then reinforce LaTeXML as well to not produce garbage metadata)

Is there any thought of running a bot over the PM that fixes metadata. If I remember correctly, then Wikipedia does something like this. A correction of the source would have the plus that the author (or a maintainer) can correct, if the bot gets it wrong.

Michael

— Reply to this email directly or view it on GitHub https://github.com/holtzermann17/planetmath-docs/issues/40#issuecomment-16597475.


Prof. Dr. Michael Kohlhase, Office: Research 1, Room 168 Professor of Computer Science Campus Ring 1, Jacobs University Bremen D-28759 Bremen, Germany tel/fax: +49 421 200-3140/-493140 skype: m.kohlhase

m.kohlhase@jacobs-university.de http://kwarc.info/kohlhase

dginev commented 11 years ago

Ok, I think the scheme I have supported now has some sanity. Things that need to be addressed in the future:

But for now I am happy, I am almost done indexing PlanetMath and am closing the ticket.