Closed dginev closed 11 years ago
Oh, there are many examples of articles that define concepts but don't use pm:defines
as they seemingly assume their pm:title
will be indexed as a concept name. Here is one example.
@dginev - pm:title
as defined term is indeed the legacy way of thinking about things. pm:defines
is for extra definitions that aren't equivalent to the title, whereas pm:synonym
is for extra terms that are equivalent to the title.
In short: Your plan in bullet points above does seem like the right thing to do.
I am working on that... another addition is that I will skip any definitions without an MSC class specified, as they cause more problems. Or should I instead link them to an arbitrary top-level class, e.g. 00-XX for general?
(there is a lot of broken metadata fields on the PM site right now btw, currently making my indexing robust to guard against them. I suspect I should then reinforce LaTeXML as well to not produce garbage metadata)
Some synonyms use TeX's math syntax to try and specify partial or entire math expressions as synonyms. I have currently updated my indexer to ignore such entries, in the long run we should convert them to MathML via LaTeXML and have them ready to be indexed in MathML. But that's a late summer task.
Rather than skipping definitions without MSC, how about creating a category
called XX-XX
and assigning them to that?
It needs to be a real category, otherwise it would definitely confuse the disambiguation mechanism... But I agree that it would be nice not to lose concepts because of missing classification.
Then again the disambiguation can just have a custom rule that inspects the XX-XX and it is unambiguously marking up "no category", so... OK, I accept your suggestion.
Btw, I am currently indexing PlanetMath.org, so let me know if it bogs down the server too much - if so I will space out my requests.
I noticed it was a little slower than usual but not SO bad. It reminded me to ask Constantin about some Javascript fixes (https://github.com/KWARC/planetary/issues/356).
On 18.4.13 21:00, Deyan Ginev wrote:
(there is a lot of broken metadata fields on the PM site right now btw, currently making my indexing robust to guard against them. I suspect I should then reinforce LaTeXML as well to not produce garbage metadata)
Is there any thought of running a bot over the PM that fixes metadata. If I remember correctly, then Wikipedia does something like this. A correction of the source would have the plus that the author (or a maintainer) can correct, if the bot gets it wrong.
Michael
— Reply to this email directly or view it on GitHub https://github.com/holtzermann17/planetmath-docs/issues/40#issuecomment-16597475.
Prof. Dr. Michael Kohlhase, Office: Research 1, Room 168 Professor of Computer Science Campus Ring 1, Jacobs University Bremen D-28759 Bremen, Germany tel/fax: +49 421 200-3140/-493140 skype: m.kohlhase
Ok, I think the scheme I have supported now has some sanity. Things that need to be addressed in the future:
But for now I am happy, I am almost done indexing PlanetMath and am closing the ticket.
I am reindexing PlanetMath.org with NNexus today, trying to get all issues sorted out. However, the metadata still seems shaky and I am not sure how to index it best. Here is a good example that summarizes most of my concerns:
http://planetmath.org/isoscelestriangle
We see the following metadata:
The article has
pm:title
_isosceles triangle_, and provides a synonym _isosceles_ to that extent. However, it alsopm:defines
two other concepts, namely _base angle_ and _vertex angle_.I was firstly thinking to only index articles that have
pm:defines
but clearly, that would omit this article, which also defines itspm:title
. Then again, there are articles that really don't define anything, such as this one which I don't want to index (as expected they don't have any pm:defines). But maybe I have to live with some junk in the index...Is this sane:
pm:title
is a concept namepm:title
conceptpm:defines
get indexed as separate concepts with no synonyms.That would cover the triangle article and I will have to live with the junk from the other article, it is in any case too specific to ever get linked against. We should have a metadata-curation initiative for the PM articles at some point.