google / corpuscrawler

Crawler for linguistic corpora
Other
192 stars 55 forks source link

[gd] Extend Scottish Gaelic corpus #12

Closed brawer closed 6 years ago

brawer commented 7 years ago

For Scottish Gaelic, https://dasg.ac.uk/text/ now contains plaintext files which makes it easier to crawl than before. Some material is multilingual, but it’s already language-tagged with a custom tagging scheme using tags such as <eng> and <gai>. For example, https://dasg.ac.uk/text/68.txt has English sections that are marked up like this — a trivial regexp subsitution should be able to remove the English sections:

Dh’fhosgail e i; is léugh e:
<eng>The Queen, who is lying very ill, urges your immediate attendance.
(Signed) Eveleyn Marlborough.<gai>
“Ma thig am Prionnsa,” thuirt e, ...

/cc @jimregan

jimregan commented 7 years ago

I'll take a look; if Scots Gaelic texts are anything like Irish ones, there's bound to be bits of inline English -- if that's marked up in the same way, it might be better to retain it.

jimregan commented 7 years ago

I've started a scraper for this in #16

brawer commented 6 years ago

Thank you! Uploaded token counts, linked from README file.