More commits to statistics.py

lahwaacz / wiki-scripts

Framework for writing bots, maintenance scripts or performing data analysis on wikis powered by MediaWiki

http://lahwaacz.github.io/wiki-scripts/

GNU General Public License v3.0

27 stars 12 forks source link

More commits to statistics.py #8

Closed kynikos closed 9 years ago

kynikos commented 9 years ago

Here are more improvements to statistics.py. I've taken a look to your statistics branch, I like it except for 1231470e2cfb31333e132ff30e16d7232605df92 which to me seems to complicate the code and the wiki page for no clear advantage. As you've written in the commit message, I think we'd better wait until mwparserfromhell supports tables, but of course I'm available for discussing it further :) Now, I've had to solve some conflicts in b548e059ea9a1c52172fb16f5b12bc091221b612 so the commit has been associated to me, even though you are the actual author: if you want to save time just merging all these commits without solving the conflicts again, please remember to recommit b548e059ea9a1c52172fb16f5b12bc091221b612 (and maybe e10c61c589fca384f1c816000fd509e39a04dfda if you want) with your credentials!

kynikos commented 9 years ago

Update: I've re-merged my commits with --no-ff for consistency with the other merges, so you may want to recommit 658929334436c9f70fb494db9ca26feb4fe59586 as well.

lahwaacz commented 9 years ago

I think that identifying the entry points (tables, introductory paragraphs etc.) by ID is the only reliable way, especially if we want to add more tables to the page. Some tables/sections might have multiple introductory paragraphs and ID-based matching allows adding manual text anywhere in the section without the need to update the script.

The code could be further cleaned up by introducing a get_node_by_id(wikicode, id, tag=None) helper function, which would return a node in wikicode with given id (optionally restricting the match to given tag).

kynikos commented 9 years ago

You say that "identifying the entry points [...] is the only reliable way" but the sentence seems a bit incomplete to me... I think it's missing the problem that entry points try to solve, which IMHO, at least at the moment, doesn't exist :) I think the whole point of using a wikitext parser like mwparserfromhell is to be able to isolate sections of the source text without having to add explicit entry points. If mwparserfromhell isn't enough for our goals, then we do need entry points, but in that case there's an IMHO much cleaner solution, i.e. using html comments, like I'm doing in e.g. Table_of_contents: they are extremely much easier to retrieve in the source text, without requiring a special parser that has to find html tags, choose spans, find their attributes, choose the right id etc. everything done with expensive regular expressions... Html comments can easily be found as normal substrings, with the additional advantage that they don't modify the rendered html page.

lahwaacz commented 9 years ago

OK, I'll leave 4ecd286f0c8a5cf663a3c0629984f09838af47a1 in the statistics branch in case it is needed in the future...

I've also merged your branch manually, which Github didn't notice, closing.