Open hut8 opened 1 year ago
Thanks for opening this! I don't have time to investigate this myself, but would gladly marge in a PR if you put one together with a fix.
I think this should be easy to fix with a more sophisticated splitting regex, rather than splitting on ),(
, we can split on \'\),\(([1-9][0-9]*,[0-9]+,\')
While this isn't absolutely foolproof (i.e. someone could make an adversarial redirect that would screw us), it's quite unlikely that this will happen.
),(
is already unlikely, now '),(1,0,'
is much more unlkely, as it requires an extra:
'
before'
afterI'm rewriting the pipeline in snakemake to make it easier to debug and run in parallel.
Thanks for the help @corneliusroemer - I would gladly merge in the snakemake PR if you're willing to open it.
@hut8 - Thanks for getting the conversation started!
I'm trying to import the dump from enwiki-20221001
This ends up creating this line (which has the wrong title, and also has only 2 columns instead of three) in pages.txt.gz:
Here's some context for surrounding lines:
I will do some more research on this shortly.