thunlp / ERNIE

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"
MIT License
1.41k stars 267 forks source link

Missing numbers using wikiextractor #46

Closed mpagli closed 4 years ago

mpagli commented 4 years ago

Hey,

Thanks for the nice work. I just wanted to point to some open issue of wikiextractor, in case you are not aware of it: https://github.com/attardi/wikiextractor/issues/189

Some numbers are missing in the output. Here is an example:

Andorra is the <a href="European%20microstates">sixth-smallest nation in Europe</a>, having an area of and a population of approximately .

Instead of:

Andorra is the <a href="European%20microstates">sixth-smallest nation in Europe</a>, having an area of 468 square kilometers (181 sq mi) and a population of approximately 77,006

Are the published results based on a wiki corpus with missing numbers or is it a recent bug?

zzy14 commented 4 years ago

I think it is a recent bug. The published results are based on the version in 2018.