Open desb42 opened 5 years ago
Thanks for the commit. I took a quick look at it now, but Github clobberred the diff: https://github.com/desb42/xowa/commit/af19cf83209ba765fa6be0157e23118856a8ac70
The main part seems to be
// Multiple prefixes may abut each other for nested lists.
while (cur_pos < src_len) {
byte b = src[cur_pos];
if (b == Byte_ascii.Star || b == Byte_ascii.Hash || b == Byte_ascii.Semic || b == Byte_ascii.Colon) {
cur_pos++;
}
else
break;
}
Let me look at it a little more later
Its slightly more than that. There is a big section in Xoh_html_wtr.java delimited by
// -------------------------------
where most of the work is done
Ah, missed that. It looks like you ported all the code in https://github.com/wikimedia/mediawiki/blob/master/includes/parser/BlockLevelPass.php#L190
Which is pretty cool. That's what I was planning to do, and will ultimately be the direction of all XOWA parser code (abandon the custom DOM structure and replicate what MediaWiki does, only in Java)
[Sorry, premature comment]
I actually tried reproducing a lot of the code. The above part is here already: https://github.com/gnosygnu/xowa/blob/master/gplx.xowa.mediawiki/src/gplx/xowa/mediawiki/includes/parsers/XomwBlockLevelPass.java#L195 . There are a bunch of similar parallel code blocks in gplx.xowa.mediawiki. I just haven't integrated them yet into the main XOWA project
This is something I'd like to do, but I'm still a little wary about changing too much at the moment. Let me think about doing some incremental replacements and seeing if I can co-opt some parts.
Thanks.
As part of #417 I mentioned that the \
\- \
- logic did no seem right
I have been looking at mediawiki\includes\parser\BlockLevelPass.php to see how its done and have a suggested change in my branch lists_new
My idea was to make the tokeniser just count the number of list elements (hence no limit on how many elements) and get the html generator to do the work
As xowa has tokenised the various html elements the complexity in the php code (with all the regexes) is unnecessary.