spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
768 stars 129 forks source link

Support <ref group="note"> #85

Open fractalien opened 6 years ago

fractalien commented 6 years ago

In London / EN, there is a section <ref group="note">...</ref> of which the parser keeps the starting tag and the content.

dmh-cs commented 6 years ago

My understanding is that in order for this to work without breaking anything, we'll have to drop xml after parsing the citations. But that seems complicated since that is done inside the sections processing section.

spencermountain commented 6 years ago

yeah, that's correct. we're getting burned by order-of-operations stuff all-over the place. @fractalien i can't find it, am I missing something?

wtf.fetch('London', 'en', function(err, doc) {
  console.log(doc.plaintext().match('"ref"'));
});

cheers

spencermountain commented 5 years ago

closing until this can be reproduced

niebert commented 5 years ago

a workaround for this can be a replacement of REF XML-tags by a kind of text token ___REF_GROUP_note____ and the parser will regard that as ordinary text and it remains even in the plain text as output.

<ref group="note">

Even in plain text finally the a replacement of citations into "[1]" can be perform without any need to alter kill_xml(). Of course it is a hack to generation of an AST tree node for data, that we want to preserve. Furthermore those type of tokens will not cause any conflict with any other parsing steps. We could replace the underscore by another character wrapping the ref-data we want to preserve, as long as it does not creates any conflict with existing syntax of the wiki source.

If Spencer is Ok with implementing such a workaround and preserve the current order of parsing and killing XML. cheers

spencermountain commented 5 years ago

ah, sorry I misunderstood this issue. I didn't know wikipedia had a special thing for references-as-notes.

yeah, niebert's strategy for doing this a-priori on the string would work. You could also pre-match them, and store offsets somehow. I have been shy about doing these, as we're throwing-around, and changing wikitext all-over the place.

There's no problem parsing these notes, storing them in doc.references, and rendering them somehow. Happy to do it.

fractalien commented 5 years ago

I'm sorry for the late reply – only now got to work on my project again. The tag is still there, but it's cleanly absent from the sentences' text now. Thanks for whichever other measure fixed it!