sweble / sweble-wikitext

The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaWiki.
http://sweble.org/sites/swc-devel/develop-latest/tooling/sweble/sweble-wikitext
70 stars 27 forks source link

Lack of documentation #68

Closed michmille closed 6 years ago

michmille commented 7 years ago

Good evening! I've gotten hold of your project but haven't managed to use it yet since it lacks some documentation. What is the Nexus class for, for example? If a full documentation is too tedious to write, or would take too long, then a simple description for each class would already help, or a list of design patterns being used.

hannesd commented 7 years ago

The use of the Nexus class (I assume you mean this class: org.sweble.wikitext.articlecruncher.Nexus) is illustrated in the swc-example-dumpcruncher code. Sadly I do not have the time to improve the documentation but we are always happy to accept pull requests.

Back when I came up with this code I thought that processing huge dumps using multiple threads is a good idea. Now I think that splitting up the input data and simply starting multiple single-threaded processes (possibly on multiple machines) is a far better way to process large amounts of data. So unless multiple processes are not an option for you I recommend not using the swc-article-cruncher module any more.

michmille commented 7 years ago

Yes, that class. Thank you for the tips. Can you help me with a question?

Am I correct in believing that when running swc-example-dumpcruncher, the point at which a page is first taken out of the dump file and loaded into a variable (PageType or otherwise) is at org.sweble.wikitext.example.RevisionProcessor's method process(Job job)?

hannesd commented 7 years ago

I'm not quite sure what you mean by "first taken out of the dump file and loaded into a variable". PageType (and other similar classes) are automatically generated from the MediaWiki schema definition by JAXB. Instances of this class are therefore created by the JAXB framework. Since Wikipedia dumps are huge and JAXB can only read a whole XML file at a time I had to apply some aspectj magic which can be found in one of the .aj aspects found in the swc-dumpreader project. This makes the code quite hard to read...

The instances created by JAXB are handed to the DumpReader.processPage() which is overridden in DumpReaderJobGenerator(). Here the read revisions are wrapped into RevisionJob objects which you then receive in the RevisionProcessor.process() method. The example takes the wikitext string from the RevisionJob object and parses it.

michmille commented 7 years ago

I meant the point at which PageType is first populated by the extracted pages, which you just answered.

I believe RevisionProcessor.process() returns the AST of a certain page, would you know at which point is the XWML of that page generated?

hannesd commented 7 years ago

RevisionProcessor.process() does not generate XWML. It parses wikitext and returns the resulting AST. If you want XWML you have to add the sweble-wom3-swc-adapter dependency to your project and call org.sweble.wom3.swcadapter.AstToWomConverter.convert(...) to get XWML.

michmille commented 7 years ago

On swc-engine, class org.sweble.wikitext.engine.WtEngineImpl, can you help me understand how expansion works? According to Design and Implementation of the Sweble Wikitext Parser[1] figure 6 (page 6), it appears that sweble makes use of a cache to store pre-processed templates. Where is this cache located at in the code? Are all templates stored here and if not, which ones are? If a page is being processed and the templates it contains haven't been processed yet, then is the result after expansion the unexpanded page?

[1] http://dirkriehle.com/uploads/2011/07/diwp.pdf

hannesd commented 6 years ago

Sorry, I've missed your question. Is it still relevant or should I close it?