sgsinclair / trombone

GNU General Public License v3.0
14 stars 6 forks source link

DToC: Re-work DtocIndex to support multi-file uploads #8

Closed ajmacdonald closed 3 years ago

ajmacdonald commented 5 years ago

Currently can only find the index if a singular XML file was uploaded. Needs to be re-worked to support the case where multiple XML files are uploaded.

https://github.com/sgsinclair/trombone/blob/262431223f202abc525e54bc5ca21a2cf63af69f/src/main/java/org/voyanttools/trombone/tool/corpus/DtocIndex.java#L99-L103

sgsinclair commented 5 years ago

What's the use case for having multiple documents? I thought DToC was oriented around a single volume.

ilovan commented 5 years ago

Hi @sgsinclair . the CWRC-integrated version of DToC is likely to be used by people who may want to group different sets of documents into a variety of corpora. (e.g. for example, a project that collected all the journalistic contributions of Francophone women journalists in Canada may want to create one corpus that would gather all contributions of a single journalist, and another one that would include all the WWI -related contributions ; same document could be included in both corpora - if written by the journalist and dealing with WWI - hence the need to support as much as possible multi-document corpora)

sgsinclair commented 5 years ago

Thanks. If there's a need to combine multiple documents together then I think the user should do that, using TEICorpus, for instance. We could theoretically wrap documents for the user, but there are a lot of things that can go wrong, with namespaces, processing instructions, etc.

If it still seems important we can do it (I think), but the option would likely only appear in the DToC interface and any documentation should have blinking lights warning about the perils of automated document wrapping.

SusanBrown commented 5 years ago

I thought it was initially designed to work with multiple documents--my memory may be faulty but your comment is a surprise to me, Stefan, since I have a pretty clear memory of the Voyant version working to combine, for instance, multiple Shakespeare plays from different files into a single DToC edition.

I'm not sure what TEICorpus is. When you say the DToC interface do you mean the CWRC interface?

ilovan commented 5 years ago

so for context, teiCorpus is an alternative root element (a TEI file could contain a teiCorpus root with multiple TEI children, but it's in the process of being deprecated - see discussion at http://tei-l.970651.n3.nabble.com/Nesting-TEI-and-deprecation-of-teiCorpus-td4032022.html)

ajmacdonald commented 3 years ago

https://github.com/sgsinclair/trombone/commit/4c3602f2ac8e2e29298f86bdf76bf4652a7729f4 https://github.com/sgsinclair/trombone/commit/d718a4c4e08ee34902ee3c79bdd426da9a0db8b8 https://github.com/sgsinclair/trombone/commit/766cf853da8c6758ddeb51340d5d8f79b1fc4793