tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
816 stars 86 forks source link

Wiki dump downloads temporarily corrupted WAS: Too deep recursion with Template:syn-saurus #894

Open kristian-clausal opened 6 days ago

kristian-clausal commented 6 days ago

Parsing page '今生' I get this on the newest commit; this is using a .db from the newest dumpfile from enwiktionary.

今生/Chinese/noun: ERROR: too deep recursion during template expansion at ['今生', 'syn-saurus', '#invoke', '#invoke', 'Lua:saurus:saurus()', 'frame:expandTemplate()', 'col3', '#invoke', '#invoke', 'Lua:columns:display()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()', 'm', '#invoke', '#invoke', 'Lua:links/templates:l_term_t()', 'frame:preprocess()',

EDIT:

Oh thank god it's not our fault:

XML dumps are paused: https://lists.wikimedia.org/hyperkitty/list/xmldatadumps-l@lists.wikimedia.org/thread/BXWJDPO5QI2QMBCY7HO36ELDCRO6HRM4/

the notice said 20241020 dump files "may have underlying data quality issues", 20241001 files are good.

Originally posted by @xxyzz in https://github.com/tatuylonen/wiktextract/issues/894#issuecomment-2453852199

kristian-clausal commented 6 days ago

There has been activity in Module:links at the start of this month.

kristian-clausal commented 5 days ago

Yeah, afaict this is just a bug in a common module that slipped through into the current dumpfile, so I've reverted the dumpfile used in the Kaikki.org regeneration to the last one from 20241001.

Leaving this thread open so that when I see it in six months I can go "oh shit I need to change the dumpfile back to latest".

xxyzz commented 5 days ago

It should be fixed in the next 20241101 dump file?

kristian-clausal commented 5 days ago

Well, hopefully. It's possible this isn't a bug on the wiktionary side, and it's a bug on our side that only expresses in this dumpfile, but it doesn't cost anything to wait and see in this minor case.

xxyzz commented 3 days ago

The changes of some used Lua models were made before but not after Oct 20, I think it's likely a bug in our code. Here is what I found: the code stuck in loop when creating links for the last term in page Thesaurus:今生: [[这]][[一]][[輩子]], the bug could be in Lua code handle links in "col3" template parameters.

kristian-clausal commented 3 days ago

There's a difference between the dump files (one dump file works, the other one doesn't), so there's something that's changed. It can be a bug on our side, but we can wait until next week for a new dumpfile and test if it works, and if it doesn't then we can try to figure things out on our side.

xxyzz commented 16 hours ago

XML dumps are paused: https://lists.wikimedia.org/hyperkitty/list/xmldatadumps-l@lists.wikimedia.org/thread/BXWJDPO5QI2QMBCY7HO36ELDCRO6HRM4/

the notice said 20241020 dump files "may have underlying data quality issues", 20241001 files are good.

kristian-clausal commented 16 hours ago

Well, that explains it then! We'll just have to wait for a good dump.