Upgrade to latest en-wiki export

mikemccand commented 3 years ago

The enwiki files we use for luceneutil benchmarks, including nightly benchmarks, are very very old by now, almost a decade: /l/data/enwiki-20110115-lines-1k-fixed.bin.

Also, how these files were created is not exactly clear.

Let's 1) upgrade to latest enwiki export, and 2) clarify the tools used to create the binary block form of these documents source file.

msokolov commented 3 years ago

A "stretch goal" would be to automate refreshing the wiki docs using some scripting - I recently added an Ant build.xml, or this could use Python - better than a list of instructions in a README. although that would be great too!

mikemccand commented 3 years ago

+1, that'd be wonderful!

I'm starting first by creating a Python tool that invokes the numerous steps needed to create our line doc files ... it should hopefully not be so hard after that to automate it :)

I have a start at the tool, and curiously it hit this exception last night:

11000351 articles extracted
first file is 65445166698 bytes
Traceback (most recent call last):
  File "/l/util.nightly/src/python/createEnglishWikipediaLineDocsFile.py", line 89, in <module>
    main()
  File "/l/util.nightly/src/python/createEnglishWikipediaLineDocsFile.py", line 69, in main
    WikipediaExtractor.process_data(f_in, splitter)
  File "/l/util.nightly/src/python/WikipediaExtractor.py", line 577, in process_data
    WikiDocument(output, id, title, ''.join(page))
  File "/l/util.nightly/src/python/WikipediaExtractor.py", line 136, in WikiDocument
    out.write(line.encode('utf-8'))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 236: surrogates not allowed

Which I think means the XML export contained invalid UTF-8. I added a little more instrumentation so I should know more later today -- it takes hours to run now! I gotta figure out how to make these tools concurrent :)

msokolov commented 3 years ago

I'm not sure, but I think that error means Python Unicode support is still terrible! Maybe not - think there is some error handling available? Yeah https://docs.python.org/3/library/stdtypes.html#bytes.decode takes an errors="replace" arg that should help

mikemccand commented 3 years ago

Yeah I switched to errors=replace on catching this exception, and printed the offending string character by character. I suspect this means Python is using UTF16 representation in memory, and there was an invalid surrogate pair? Not sure, but yeah, it is disappointing!

[mike@beast3 data]$ python3 -u /l/util.nightly/src/python/createEnglishWikipediaLineDocsFile.py /tmp/out.txt
Using temp dir "/b/tmp"
first file is 65445166698 bytes
0: 49
1: 20
2: 74
3: 72
4: 69
5: 65
6: 64
7: 20
8: 74
9: 6f
10: 20
11: 70
12: 61
13: 72
14: 73
15: 65
16: 20
17: 74
18: 68
19: 65
20: 20
21: 32
22: 30
23: 30
24: 35
25: 30
26: 39
27: 30
28: 39
29: 20
30: 57
31: 69
32: 6b
33: 69
34: 70
35: 65
36: 64
37: 69
38: 61
39: 20
40: 64
41: 75
42: 6d
43: 70
44: 20
45: 28
46: 70
47: 61
48: 67
49: 65
50: 73
51: 5f
52: 63
53: 75
54: 72
55: 72
56: 65
57: 6e
58: 74
59: 2e
60: 78
61: 6d
62: 6c
63: 29
64: 20
65: 77
66: 69
67: 74
68: 68
69: 20
70: 45
71: 78
72: 70
73: 61
74: 74
75: 20
76: 28
77: 74
78: 68
79: 72
80: 6f
81: 75
82: 67
83: 68
84: 20
85: 50
86: 79
87: 74
88: 68
89: 6f
90: 6e
91: 20
92: 32
93: 2e
94: 34
95: 29
96: 2c
97: 20
98: 61
99: 6e
100: 64
101: 20
102: 69
103: 74
104: 20
105: 66
106: 61
107: 69
108: 6c
109: 65
110: 64
111: 20
112: 77
113: 69
114: 74
115: 68
116: 20
117: 65
118: 78
119: 63
120: 65
121: 70
122: 74
123: 69
124: 6f
125: 6e
126: 20
127: 22
128: 72
129: 65
130: 66
131: 65
132: 72
133: 65
134: 6e
135: 63
136: 65
137: 20
138: 74
139: 6f
140: 20
141: 69
142: 6e
143: 76
144: 61
145: 6c
146: 69
147: 64
148: 20
149: 63
150: 68
151: 61
152: 72
153: 61
154: 63
155: 74
156: 65
157: 72
158: 20
159: 6e
160: 75
161: 6d
162: 62
163: 65
164: 72
165: 22
166: 20
167: 6f
168: 6e
169: 20
170: 74
171: 68
172: 65
173: 20
174: 43
175: 68
176: 69
177: 6e
178: 65
179: 73
180: 65
181: 20
182: 66
183: 6f
184: 72
185: 20
186: 42
187: 6f
188: 68
189: 72
190: 69
191: 75
192: 6d
193: 2c
194: 20
195: 77
196: 68
197: 69
198: 63
199: 68
200: 20
201: 77
202: 61
203: 73
204: 20
205: 77
206: 72
207: 69
208: 74
209: 74
210: 65
211: 6e
212: 20
213: 77
214: 69
215: 74
216: 68
217: 20
218: 61
219: 6e
220: 20
221: 58
222: 4d
223: 4c
224: 20
225: 73
226: 75
227: 72
228: 72
229: 6f
230: 67
231: 61
232: 74
233: 65
234: 20
235: 28
236: d801
237: 29
238: 2e
239: 20
240: 57
241: 68
242: 61
243: 74
244: 20
245: 73
246: 68
247: 6f
248: 75
249: 6c
250: 64
251: 20
252: 49
253: 20
254: 64
255: 6f
256: 3f
257: 20
258: 57
259: 68
260: 65
261: 72
262: 65
263: 20
264: 69
265: 73
266: 20
267: 74
268: 68
269: 65
270: 20
271: 66
272: 61
273: 69
274: 6c
275: 75
276: 72
277: 65
278: 20
279: 69
280: 6e
281: 20
282: 73
283: 74
284: 61
285: 6e
286: 64
287: 61
288: 72
289: 64
290: 73
291: 20
292: 63
293: 6f
294: 6d
295: 70
296: 6c
297: 69
298: 61
299: 6e
300: 63
301: 65
302: 2c
303: 20
304: 69
305: 6e
306: 20
307: 45
308: 78
309: 70
310: 61
311: 74
312: 20
313: 6f
314: 72
315: 20
316: 74
317: 68
318: 65
319: 20
320: 57
321: 69
322: 6b
323: 69
324: 70
325: 65
326: 64
327: 69
328: 61
329: 20
330: 64
331: 75
332: 6d
333: 70
334: 20
335: 28
336: 6f
337: 72
338: 20
339: 69
340: 73
341: 20
342: 69
343: 74
344: 20
345: 70
346: 6f
347: 73
348: 73
349: 69
350: 62
351: 6c
352: 65
353: 20
354: 74
355: 68
356: 61
357: 74
358: 20
359: 49
360: 20
361: 6d
362: 65
363: 73
364: 73
365: 65
366: 64
367: 20
368: 75
369: 70
370: 20
371: 73
372: 6f
373: 6d
374: 65
375: 74
376: 68
377: 69
378: 6e
379: 67
380: 3f
381: 29
382: 3f
383: 20
384: 57
385: 68
386: 61
387: 74
388: 20
389: 70
390: 61
391: 72
392: 73
393: 65
394: 72
395: 20
396: 64
397: 6f
398: 20
399: 79
400: 6f
401: 75
402: 20
403: 75
404: 73
405: 65
406: 2f
407: 72
408: 65
409: 63
410: 6f
411: 6d
412: 6d
413: 65
414: 6e
415: 64
416: 20
417: 66
418: 6f
419: 72
420: 20
421: 70
422: 61
423: 72
424: 73
425: 69
426: 6e
427: 67
428: 20
429: 74
430: 68
431: 65
432: 20
433: 64
434: 75
435: 6d
436: 70
437: 3f
8645.0 sec to run WikipediaExtractor

mikemccand commented 3 years ago

Well, this is fun -- I printed the above text fragment, which indeed seems to contain a single un-paired high surrogate, and this is the text:

I tried to parse the 20050909 Wikipedia dump (pages_current.xml) with Expat (through Python 2.4), and it failed with exception "reference to invalid character number" on the Chinese for Bohrium, which was written with an XML surrogate (? -- \d801). What should I do? Where is the failure in standards compliance, in Expat or the Wikipedia dump (or is it possible that I messed up something?)? What parser do you use/recommend for parsing the dump?

It's somewhat hilarious to me that this poor soul, ~14 years ago now, was struggling to parse a Wikipedia dump then, hit this unpaired surrogate, left a comment about the struggle, which survives in Wikipedia to the export today, which is still, after 14 years, not properly encoded!!

mikemccand commented 3 years ago

I wrote this silly little tool to print the actual text fragment:

s = ```
...the lines above...

l = [] for line in s.splitlines()[1:]: tup = line.split(':') l.append(chr(int(tup[1].strip(), 16)))

s = ''.join(l) print(s.encode('ascii', errors='replace'))

mikemccand commented 3 years ago

OK it looks like this page is leading to the exception above -- search for &#xD801;.

@msokolov located the likely issue -- the WikipediaExtractor tool we use is not a true XML parser, but rather uses regexps to make a "best effort" attempt to extract plain text from the exported XML, and it is likely causing this character to be incorrectly unescaped because it is double-unescaping.

mikemccand commented 3 years ago

I think I have a script working, but I remain mystified because the new binary 1K line docs file is ~14 GB, while the old one (from 9 years ago!) was ~24 GB. Surely English Wikipedia has grown substantially in the past 9 years, so there must be some bug in the script ...

mikemccand / luceneutil

Upgrade to latest en-wiki export #91