sweble / sweble-wikitext

The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaWiki.
http://sweble.org/sites/swc-devel/develop-latest/tooling/sweble/sweble-wikitext
70 stars 27 forks source link

Don't throw exepction when an alias is registered #73

Closed tgalery closed 6 years ago

tgalery commented 6 years ago

Connects to #72 , the main problem is that some parsing functions expect specific aliases to be registered within the wikiconfig, e.g. the name space parsing function expects ns. Now, in some cases these aliases are not registered because a different alias was registered to the same name. This PR allows for multiple alias registration, which in turn allows for the parsing functions to be configured properly. From the interpreter (previously broken)

scala> import org.sweble.wikitext.engine.utils.LanguageConfigGenerator
import org.sweble.wikitext.engine.utils.LanguageConfigGenerator

scala> val config = LanguageConfigGenerator.generateWikiConfig("ja")
The name `sub' was already registered by the alias `sub' but we are also registering it for  `img_sub'.
The name `名前空間:' was already registered by the alias `namespace' but we are also registering it for  `ns'.
The name `noreplace' was already registered by the alias `shortdesc_noreplace' but we are also registering it for  `defaultsort_noreplace'.
The name `noerror' was already registered by the alias `defaultsort_noerror' but we are also registering it for  `displaytitle_noerror'.
The name `noreplace' was already registered by the alias `defaultsort_noreplace' but we are also registering it for  `displaytitle_noreplace'.
config: org.sweble.wikitext.engine.config.WikiConfig = org.sweble.wikitext.engine.config.WikiConfigImpl@523072fe
hannesd commented 6 years ago

Sweble Engine uses the SLF4J framework. See WtEngineImpl for examle. Please use logging instead of System.out/err.

tgalery commented 6 years ago

@hannesd thanks, I've switched to the logging framework. I'm trying to get some pointers with regards to the changes that need to be done in order to implement the idea above. In this map we would have a one-to-one relationship between an aliasId and an alias, so that would be ok to stay as is. The variable that would need changing would be this one, probably to a HashMap<String, Set<I18nAliasImpl>>, right ?

hannesd commented 6 years ago

So here are my thoughts: An alias in Sweble (and of course I only tried to mimic MediaWiki's behavior) is an ID associated with a list of names (the aliases). A parser function or magic word can "listen" to any of the aliases associated with the ID of such an alias definition. That's why every parser function in Sweble has to declare an ID which is the ID of an alias definition.

That's also the purpose of the two maps. One (aliasesById) holds the alias IDs for lookup when a parser function tries to find its alias definition. The other map (nameToAliasMap) maps the alias names to the alias (ID). So the aliasesById map is really only used during setup when parser functions try to find their alias definition. nameToAliasMap is used during parsing when a page switch or parser function is encountered. First the name is resolved to an alias ID and then Sweble looks for the parser function or page switch, etc. which "listens" for this alias definition.

And here's what is not clear to me: If two different alias IDs are associated with the same alias name, then what is supposed to happen when the parser stumbles onto that alias name? It resolved to two different alias IDs and those alias IDs might be connected with different parser functions, page switches, etc. Which parser function should we call?

The pull request actually resolves this problem by overwriting any previous uses of the alias. But that, to me, somehow implies that the configuration is faulty (which is why the code throws an error).

One could associate an alias name with multiple alias IDs but then I don't know how to resolved those multiple IDs to only one parser function to call.

tgalery commented 6 years ago

@hannesd @mawiesne I think there's a simpler solution. Looking at at the magic words config that generates the wikiconfig https://ja.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=magicwords&format=xml, the bug can be described as follows:

1 - alias 名前空間 is registered for the NAMESPACE id 2 - alias 名前空間 appears again for the ns id (things are slightly more complicated because for the second is appears followed by a :, but there's some suffix modifying code that adds the colon under certain circumstances) 3 - Since we throw an exception here when detecting the type in conflict in (2) and the 名前空間 magic word is the first alias of the ns id, the whole id and associated aliases are not added to the maps. 4 - When adding parsing functions, we explicitly expect one capable of parsing one for ns. Since there's no one that can be found in the maps, an exception is thrown.

A better way is not to thrown an exception and just skip the ambiguous identifier. In this case 名前空間 would be associated with the NAMESPACE id, but not the ns id, but we would still have a working config and all the other non ambiguous aliases would be registered to the right ids.

Would that solution be ok (I've changed the name of the PR to reflect it). I will add some basic log statements to facilitate the debug together with the minor code changes. Would that be ok ?

hannesd commented 6 years ago

Thanks for sharing a link to the ja wikiconfig. Now things look a bit different. After all, there is no conflict in the config. One alias has the ':' suffix while the other does not have that suffix => no collision. The reason why a collision happens in Sweble code is because the code in LanguageConfigGenerator removes the colon under certain conditions. Unfortunately I have not written the LanguageConfigGenerator code and I have no idea why that prefix/postfix mangling is necessary. So now my feeling is that WikiConfig is actually doing the right thing. The problem lies with LanguageConfigGenerator...

tgalery commented 6 years ago

From the scala console, I get this

elcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102).
Type in expressions for evaluation. Or try :help.

scala> import org.sweble.wikitext.engine.utils.LanguageConfigGenerator
import org.sweble.wikitext.engine.utils.LanguageConfigGenerator

scala> val config = LanguageConfigGenerator.generateWikiConfig("ja")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
config: org.sweble.wikitext.engine.config.WikiConfig = org.sweble.wikitext.engine.config.WikiConfigImpl@6c5d19d6

which is ok but maybe means that some extra language configuration is needed.

hannesd commented 6 years ago

The Engine assumes that whoever uses it will configure logging properly. So that's expected behavior.

tgalery commented 6 years ago

I agree with the position, but I just think that throwing an exception at that particular branch is a bit unnecessary and maybe too harsh for the purpose of coming up with the best config for parsing a non English wiki.

hannesd commented 6 years ago

I don't mind replacing the exception with a warning log message. But this has to be done everywhere in the config code, otherwise it would just be a hack to get rid of this one very special case. And of course it will only suppress the exception. The loaded config will still be incorrect since the registration for the 'ns:' parser function is then missing. Did I understand you correctly in this regard?

tgalery commented 6 years ago

@hannesd if I understand you correctly, I do agree that there's the need for adding better logging and revisiting the way exceptions are thrown in the language configurator, but that would be probably be better tackled as a separate issue / PR . With regards to the parser function, this PR does fix the issue. The logic here is this: since 名前空間 is an alias associated with the NAMESPACE id, next time the configurator sees 名前空間 associated with the ns id, it will simply log the warning (instead of throwing an exception) and continue. This means that every other alias associated to ns will be properly configured as well as the parsing function that expects ns to be properly set. Does this make sense ?

hannesd commented 6 years ago

I guess it will happen this way but it's just luck, right? Even if the remaining aliases will be registered correctly, the one colliding alias still does not get registered when it really should. You're just lucky that you apparently do not need that alias in your specific use case, in general, however, the behavior is still broken.

tgalery commented 6 years ago

I'm not sure I understand. Bare in mind that the coliding alias does get registered, for the NAMESPACE id. AFAIK the only thing that the parser function expects is having a ns id, with some alias registered to it, which this PR allows to happen for the edge cases of two ids having the same alias (luck or not). So to my mind, it's an improvement of how things are currently. However, if there's a better way to tackle the problem, I'm happy to give that a go provided I have some direction. As far as use case goes, all we need is the ability to instantiate non-english wikiconfig that allows for better parsing of non english wikipedias by dkpro. The issue that mirror this one there is this https://github.com/dkpro/dkpro-jwpl/issues/159 .

hannesd commented 6 years ago

Well I'd prefer if the problem could be solved at the root cause (the prefix postfix magic in the config loader) but I'm traveling again and I think it's ok for now to improve the situation even if it's not really a fix.

hannesd commented 6 years ago

I'll merge it when I'm back.

tgalery commented 6 years ago

@hannesd many thanks for this as it will unblock a potential dkpro release. I totally understand where you are coming from and if you want I can create a new issue and document what is the problem with ambiguous alias that we have been discussing here. Let me know what's the best way to proceed.

mawiesne commented 6 years ago

@tgalery From my perspective, the changes are a first cure, given the title of this issue here now. @hannesd Thanks for your feedback. I think this PR can be merged safely as it does not break things and gives more control to JWPL on top of Sweble.

tgalery commented 6 years ago

@hannesd thanks for approving the PR, is there any chance a release could be pushed to maven or somewhere, so we can update the sweble version in dkpro ?

hannesd commented 6 years ago

Version 3.1.8 should now be available on maven central

tgalery commented 6 years ago

hi @hannesd , there's something funny going on with the release. It seems that the language configurator is broken again.

I've been testing this with my local version of the branch I submitted to this PR and I get this.

val config = LanguageConfigGenerator.generateWikiConfig("ja")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
config: org.sweble.wikitext.engine.config.WikiConfig = org.sweble.wikitext.engine.config.WikiConfigImpl@561a2b89

Now once I update to version 3.1.8, I get this

val config = LanguageConfigGenerator.generateWikiConfig("ja")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
java.lang.IllegalArgumentException: No alias registered for parser function `convert'.
  at org.sweble.wikitext.engine.config.WikiConfigImpl.addParserFunction(WikiConfigImpl.java:451)
  at org.sweble.wikitext.engine.config.WikiConfigImpl.addParserFunctionGroup(WikiConfigImpl.java:432)
  at org.sweble.wikitext.engine.utils.DefaultConfig.addParserFunctions(DefaultConfig.java:584)
  at org.sweble.wikitext.engine.utils.LanguageConfigGenerator.generateWikiConfig(LanguageConfigGenerator.java:119)
  at org.sweble.wikitext.engine.utils.LanguageConfigGenerator.generateWikiConfig(LanguageConfigGenerator.java:83)
  at org.sweble.wikitext.engine.utils.LanguageConfigGenerator.generateWikiConfig(LanguageConfigGenerator.java:68)
  ... 42 elided

Any other change during the release process ?

tgalery commented 6 years ago

Maybe in this commit https://github.com/sweble/sweble-wikitext/commit/6756185531c0f6820a74f85db4b08221f5ec9159 ?

mawiesne commented 6 years ago

@tgalery I think we'd have to check the code fragment in JWPL that does not catch/handle this IllegalArgumentException properly? I will try to reproduce it and if reproduced open an issue for this over at JWPL.

But: I might also be wrong due to Jetlag effects... (: