gucorpling / amalgum

English web corpus with 4M tokens and several annotation types
25 stars 6 forks source link

Figures #2

Closed amir-zeldes closed 4 years ago

amir-zeldes commented 5 years ago

The WikiExtractor used by the voyage scraper currently destroys figures/captions

lgessler commented 4 years ago

Closing since we don't intend to maintain the scraping code