voyanttools / trombone

GNU General Public License v3.0
3 stars 2 forks source link

Feature request: recursive import of a corpus #30

Open Conal-Tuohy opened 1 year ago

Conal-Tuohy commented 1 year ago

At present Voyant allows you to import a corpus in an XML format from a single URL or from a list of URLs.

I'd like to suggest adding the ability to ingest from a "linked list" of URLs, where the user provides a single URL, and the remaining URLs are retrieved in a recursive fashion: i.e. the resource which Voyant retrieves from the first URL itself contains a link to the second "page" of text, which contains a link to a third page, etc, until the final resource contains no further links.

The user would need to be able to provide one additional XPath parameter (called e.g. Next or similar) when importing the corpus, to identify an element or attribute in the XML data which would contain a link to the next page. e.g. in the case of a corpus of TEI elements contained in a teiCorpus wrapper element, the teiCorpus element can bear a next attribute whose semantics are defined in this way. So the default XPath expression for a TEI import could be //*[local-name()='teiCorpus']/@next.

This kind of approach would work for other XML formats such as Atom, which has link elements for this purpose e.g. <link rel="next" href="http://example.org/index.atom?page=2"/>

Conal-Tuohy commented 1 year ago

maybe this issue belongs on the Trombone repo? Apologies if so