jwngr / sdow

Six Degrees of Wikipedia
https://www.sixdegreesofwikipedia.com
MIT License
1.76k stars 91 forks source link

"Trimming pages file" is parsing records wrong when the title contains parenthesis #100

Open hut8 opened 1 year ago

hut8 commented 1 year ago

I'm trying to import the dump from enwiki-20221001

+----------+--------------------+
| page_id  | page_title         |
+----------+--------------------+
| 71701640 | 104-2,3,(6),(7),11 |
+----------+--------------------+

This ends up creating this line (which has the wrong title, and also has only 2 columns instead of three) in pages.txt.gz:

71701640   104-2,3,(6

Here's some context for surrounding lines:

71701608        Alberta_Sovereignty_Act 0
71701611        Homeland_Defence_Act    0
71701613        Berlin_Nobody   0
71701617        Miss_Grand_Nepal_2022   0
71701639        Pgm2_c  1
71701640   104-2,3,(6
71701649        2022_Binh_Duong_karaoke_bar_fire        0
71701668        Chapel_of_the_Christ,_San_Pablo_del_Monte       0
71701673        Ximena_Aguilera 0
71701676        Wedding_dress_of_Katharine_Worsley      0
71701682        2022–23_Central_Michigan_Chippewas_men\'s_basketball_team       0

I will do some more research on this shortly.

jwngr commented 1 year ago

Thanks for opening this! I don't have time to investigate this myself, but would gladly marge in a PR if you put one together with a fix.

corneliusroemer commented 9 months ago

I think this should be easy to fix with a more sophisticated splitting regex, rather than splitting on ),(, we can split on \'\),\(([1-9][0-9]*,[0-9]+,\')

While this isn't absolutely foolproof (i.e. someone could make an adversarial redirect that would screw us), it's quite unlikely that this will happen.

),( is already unlikely, now '),(1,0,' is much more unlkely, as it requires an extra:

I'm rewriting the pipeline in snakemake to make it easier to debug and run in parallel.

jwngr commented 9 months ago

Thanks for the help @corneliusroemer - I would gladly merge in the snakemake PR if you're willing to open it.

@hut8 - Thanks for getting the conversation started!