Closed brawer closed 6 years ago
I'll take a look; if Scots Gaelic texts are anything like Irish ones, there's bound to be bits of inline English -- if that's marked up in the same way, it might be better to retain it.
I've started a scraper for this in #16
Thank you! Uploaded token counts, linked from README file.
For Scottish Gaelic, https://dasg.ac.uk/text/ now contains plaintext files which makes it easier to crawl than before. Some material is multilingual, but it’s already language-tagged with a custom tagging scheme using tags such as
<eng>
and<gai>
. For example, https://dasg.ac.uk/text/68.txt has English sections that are marked up like this — a trivial regexp subsitution should be able to remove the English sections:/cc @jimregan