Closed xxyzz closed 4 months ago
Because the File:
and Image:
tokens can contain whitespace (I tried out File :
for example), a simple startswith isn't enough, exactly. A simple regex might be cheaper than a chain of .strips() and partitionings.
I also just committed a small bugfix to the file_alias code a couple of minutes ago before this, please check it out just in case.
I also don't understand your obsession with removing annotations when they act as perfectly good documentation.
I though the white spaces are removed in the regex patterns in the wiktextract.clean_py
file? Then we could compare the link with namespace prefixes without the ":" character.
Usually the types could be found by type checking tools shouldn't be added, unless IDE tools can't show the type then we should add the type. You could add them again... but why not use the IDE features.
I'm not using an IDE.
clean_value() should be able to receive stuff from elsewhere other than clean_node (which would generate code from a parse tree that would have cleaner syntax).
The regex at here removes the spaces: https://github.com/tatuylonen/wiktextract/blob/47f74c7e93d0cc20ffd93c729d2794172362ca24/src/wiktextract/clean.py#L1506-L1511
but not the second link regex: https://github.com/tatuylonen/wiktextract/blob/47f74c7e93d0cc20ffd93c729d2794172362ca24/src/wiktextract/clean.py#L1512-L1518
performance probably won't be a concern for this simple pattern and short text. I'll use the namespace data to crate a regex pattern at the clean.py
file since the pattern is only used there.
repl_link_bars, which is the one beneath that, is the one that actually uses the regex; AFAICT, if you have |
, repl_link_bars is the one that handles the link (and the whitespace shouldn't be removed).
I tried to make a test for clean_value() on something with [[ File :...]]
, but it didn't work...
I figured out why my test didn't work, I didn't have any |
in my test! lol
Image links without vbars should be also handled as images. Both repl_link and repl_link_bars might need to check for File
...
No, what am I talking about. repl_link
should have passed, but the link I was testing was eaten by repl1.
Does this handle File :
? There can be whitespace.
I also updated the tests in https://github.com/tatuylonen/wiktextract/pull/542 and they passed locally, but the test in that pr fails because it need commit in this pr.
wiktextract code could use
str.startswith()
andWtp.namespace_prefixes()
to replace theWtp.file_aliases_re
regex.and remove some type annotations that type checker could find
I though the French namespace JSON file doesn't have the local "Fichier" string yesterday, but it has this value.