tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Remove `Wtp.file_aliases` and simplify the `Wtp.__init__()` function #255

Closed xxyzz closed 4 months ago

xxyzz commented 4 months ago

wiktextract code could use str.startswith() and Wtp.namespace_prefixes() to replace the Wtp.file_aliases_re regex.

and remove some type annotations that type checker could find

I though the French namespace JSON file doesn't have the local "Fichier" string yesterday, but it has this value.

kristian-clausal commented 4 months ago

Because the File: and Image: tokens can contain whitespace (I tried out File : for example), a simple startswith isn't enough, exactly. A simple regex might be cheaper than a chain of .strips() and partitionings.

I also just committed a small bugfix to the file_alias code a couple of minutes ago before this, please check it out just in case.

kristian-clausal commented 4 months ago

I also don't understand your obsession with removing annotations when they act as perfectly good documentation.

xxyzz commented 4 months ago

I though the white spaces are removed in the regex patterns in the wiktextract.clean_py file? Then we could compare the link with namespace prefixes without the ":" character.

Usually the types could be found by type checking tools shouldn't be added, unless IDE tools can't show the type then we should add the type. You could add them again... but why not use the IDE features.

kristian-clausal commented 4 months ago

I'm not using an IDE.

kristian-clausal commented 4 months ago

clean_value() should be able to receive stuff from elsewhere other than clean_node (which would generate code from a parse tree that would have cleaner syntax).

xxyzz commented 4 months ago

The regex at here removes the spaces: https://github.com/tatuylonen/wiktextract/blob/47f74c7e93d0cc20ffd93c729d2794172362ca24/src/wiktextract/clean.py#L1506-L1511

but not the second link regex: https://github.com/tatuylonen/wiktextract/blob/47f74c7e93d0cc20ffd93c729d2794172362ca24/src/wiktextract/clean.py#L1512-L1518

performance probably won't be a concern for this simple pattern and short text. I'll use the namespace data to crate a regex pattern at the clean.py file since the pattern is only used there.

kristian-clausal commented 4 months ago

repl_link_bars, which is the one beneath that, is the one that actually uses the regex; AFAICT, if you have |, repl_link_bars is the one that handles the link (and the whitespace shouldn't be removed).

I tried to make a test for clean_value() on something with [[ File :...]], but it didn't work...

kristian-clausal commented 4 months ago

I figured out why my test didn't work, I didn't have any | in my test! lol

Image links without vbars should be also handled as images. Both repl_link and repl_link_bars might need to check for File...

kristian-clausal commented 4 months ago

No, what am I talking about. repl_link should have passed, but the link I was testing was eaten by repl1.

kristian-clausal commented 4 months ago

Does this handle File :? There can be whitespace.

xxyzz commented 4 months ago

I also updated the tests in https://github.com/tatuylonen/wiktextract/pull/542 and they passed locally, but the test in that pr fails because it need commit in this pr.