Open punkish opened 2 years ago
as you can see above, the
id
is different and theinnerText
is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having differentid
for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?
The id
is the XML element ID of the figureCitation
element proper, whereas the httpUri
, captionStartId
, captionStart
, captionText
, etc. are copied over from the caption of the cited figure (to make each individual figure citation actionable in HTML, for instance).
Hope that clarifies how things are intended to work ... multiple figureCitation
s with the same id
attribute sounds curious to me, though, as that kind of thing should not happen ... can you provide me with an example?
as you can see above, the
id
is different and theinnerText
is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having differentid
for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?The
id
is the XML element ID of thefigureCitation
element proper, whereas thehttpUri
,captionStartId
,captionStart
,captionText
, etc. are copied over from the caption of the cited figure (to make each individual figure citation actionable in HTML, for instance).Hope that clarifies how things are intended to work ...
Unfortunately not yet… I don't completely understand the following statement
The
id
is the XML element ID of thefigureCitation
element proper
does that mean it is the id of the figureCitation? if not, what is the correct id to use to identify a unique figureCitation? For example, in the above cited treatment, since the same captionStart="Figures 6–13"
appears 14 times, what exactly is the unique figureCitation in this case? This would help me as I would like identify and store each figureCitation as an atomic element in my db.
does that mean it is the id of the figureCitation? if not, what is the correct id to use to identify a unique figureCitation? For example, in the above cited treatment, since the same
captionStart="Figures 6–13"
appears 14 times, what exactly is the unique figureCitation in this case? This would help me as I would like identify and store each figureCitation as an atomic element in my db.
Indeed, id
is the ID of the figureCitation
, and thus should uniquely identify a figure citation within a treatment (and parent article, for that matter).
captionStart
is mainly for display purposes (the "View captionText
, which becomes the tooltip/title (basically the hover text) of the link.
captionStartId
is what logically links a figureCitation
to the cited caption
, referencing the startId
attribute of the latter.
That said, if you want each cited figure exactly once per treatment, captionStartId
is what to deduplicate by ... in fact, that's how the TreatmentBank HTML pages do it when generating the "Figures" preview area.
ok, so to summarize,
every figureCitation (with a different id
) is a unique figureCitation, but its uniqueness doesn't stem from the httpUri
that it points to (perhaps it stems from the innerText
) as different figureCitations could point to the exact same figure and captionText, etc.
am I correct in my understanding?
Your understanding is by and large correct, yes.
Only that the uniqueness of each individual figure citation stems from its being its own entity, its own annotation in the overall treatment text, which wouldn't be the same if you omitted either one of the figure citations. Technically, in IMFs the ID actually stems from the position of the figure citation in the page, the page ID, and the UUID of the parent article, and that applies to all annotations / XML elements.
hi @gsautter, I am revisiting this issue because I am convinced I have to rethink the design of my figureCitations table… your input would be very helpful.
Currently, I create and populate a table using the XML attributes. I ignore the inner text of the XML tags except for when I store the fulltext of the treatment. Given the example below from treatment https://tb.plazi.org/GgServer/xml/000040332F2853C295734E7BD4190F05 (I have reproduced only two figureCitations from this treatment but there are more identical ones)
<figureCitation
id="C1F77C1A81523CAB218B4F57FAE1C18B"
captionStart="Figures 2–5"
captionStartId="F2"
captionText="Figures 2 - 5. Saigona baiseensis Zheng & Chen sp. nov. 2 male, holotype, dorsal view 3 male, head and thorax, dorsal view 4 male, head, frons and clypeus, lateral view 5 male, head and pronotum, lateral view. Scale bars: 2 mm (2 - 5)."
figureDoi="10.3897/zookeys.1054.67004.figures2-5"
httpUri="https://binary.pensoft.net/fig/574423"
pageId="0"
pageNumber="185">2-5</figureCitation>
) longer than pronotum and mesonotum combined (1.45:1). Vertex (Fig.
<figureCitation
id="B09623422FE1A7E92015BCFC1B22562E"
captionStart="Figures 2–5"
captionStartId="F2"
captionText="Figures 2 - 5. Saigona baiseensis Zheng & Chen sp. nov. 2 male, holotype, dorsal view 3 male, head and thorax, dorsal view 4 male, head, frons and clypeus, lateral view 5 male, head and pronotum, lateral view. Scale bars: 2 mm (2 - 5)."
figureDoi="10.3897/zookeys.1054.67004.figures2-5"
httpUri="https://binary.pensoft.net/fig/574423"
pageId="0"
pageNumber="185">3</figureCitation>
Given the XML above, my figureCitations table ends up with duplicate records where the only difference is the figureCitationId. I am thinking that I should perhaps also store the innerText in order to make these records really distinct. Otherwise I don't really see how this information could be useful for any analysis. What do you think? Or, could you add the innerText also as an attribute of the figureCitation tag?
Further, if I create another table with distinct httpUris, I will have a one-to-many relationship between httpUri and figureCitationId. In a case like above, I have one httpUri and several related figureCitations with different Ids but otherwise everything else being identical. Could I also have related figureCitations where some attribute, for example, captionText, is different. How do you suggest I handle the data in such cases?
Thanks
The id
of an annotation identifies the annotation proper ... if you only want the cited figures, not the individual citations, I'd go for the httpUri
attribute (the figureDoi
might not always be present, as some source documents, especially in TaxPub, might come with the former, but without the latter).
Could I also have related figureCitations where some attribute, for example, captionText, is different. How do you suggest I handle the data in such cases?
No ... the captionText
always is the same for the citation of the same figure, as is the httpUri
, as well as the captionStart
.and captionStartId
(the latter is the "foreign key" joining figure citations to figures, specifically the captions associated with them).
consider the following https://tb.plazi.org/GgServer/xml/03A25264CA15FFFAEF37FB524211FE1F where the httpUri
is the same but the captionText
is different (as are the captionStartId
and captionStart
<figureCitation id="1330FFF7CA16FFE5EC2AFECF434CFE8F" box="[520,585,290,317]" captionStart="Figure 1" captionStartId="15.[751,820,1352,1377]" captionTargetBox="[159,708,1067,1550]" captionTargetId="figure@15.[156,716,1049,1550]" captionTargetPageId="15" captionText="Figure 1. Male cerci in dorsal view (left) and paraprocts in ventral view with dotted outline of cerci (right) of Umma gumma sp. nov. and U. longistigma." httpUri="http://dx.doi.org/10.5281/zenodo.35441" pageId="15" pageNumber="460">
Fig.
<number id="3DC3F155CA16FFE5EC1EFECF434CFE8F" box="[572,585,290,317]" pageId="15" pageNumber="460" value="1.0">1</number>
</figureCitation>
and
<figureCitation id="1330FFF7CA15FFE6EFF4FAE64331FA94" box="[470,564,1291,1318]" captionStart="Photo 1" captionStartId="16.[159,225,1096,1120]" captionTargetBox="[159,1035,459,1055]" captionTargetId="figure@16.[159,1038,459,1058]" captionTargetPageId="16" captionText="Photo 1. Umma gumma, male; Moyabi, Gabon. Photo: JK (24 - ix- 2013)" httpUri="http://dx.doi.org/10.5281/zenodo.35441" pageId="12" pageNumber="457">
(Type Photo
<number id="3DC3F155CA15FFE6EC06FAE64331FA94" box="[548,564,1291,1318]" pageId="12" pageNumber="457" value="1.0">1</number>
</figureCitation>
which brings me to this specific treatment which seems to be identical to https://tb.plazi.org/GgServer/html/03A25264CA08FFFCEEC8FEA94509FE3A except for that Umma Gumma bit
The
id
of an annotation identifies the annotation proper ... if you only want the cited figures, not the individual citations, I'd go for thehttpUri
attribute (thefigureDoi
might not always be present, as some source documents, especially in TaxPub, might come with the former, but without the latter).
I will keep the figureCitations table but build another derived table with httpUri as the primary key. Which brings me to the httpUri itself. Mostly it points to a figure on Zenodo. In some cases it points to a figure on Pensoft. But, sometimes it is a DOI. That is problematic for an application like Ocellus as the DOI can't be used in an <img>
tag. Since a DOI is really a pointer to the original record, it would be helpful if httpUri didn't use it but was always, consistently pointing to the image. Perhaps the DOI could be stored as a supplemental column
consider the following https://tb.plazi.org/GgServer/xml/03A25264CA15FFFAEF37FB524211FE1F where the
httpUri
is the same but thecaptionText
is different (as are thecaptionStartId
andcaptionStart
Well, this simply looks like an error to me ... what I described earlier is how the data should be ... this duplicate case here is simply wrong. Looking at the upload date (2015-12-13), it's a very early IMF, and also a very early Zenodo export (the five digit deposition number 35441
says it all).
I suggest you simply go with what I described above, as such errors will be sorted out eventually.
I will keep the figureCitations table but build another derived table with httpUri as the primary key. Which brings me to the httpUri itself. Mostly it points to a figure on Zenodo. In some cases it points to a figure on Pensoft. But, sometimes it is a DOI. That is problematic for an application like Ocellus as the DOI can't be used in an
<img>
tag. Since a DOI is really a pointer to the original record, it would be helpful if httpUri didn't use it but was always, consistently pointing to the image. Perhaps the DOI could be stored as a supplemental column
The Zenodo httpUri
s are from our IMFs, which we upload ourselves, whereas the Pensoft httpUri
s come in with the TaxPub, and we never even touch the figure. The httpUri
s that are set to DOIs shouldn't be that way ... another case of a pending cleanup.
If the httpUri
doesn't work for you as a primary key due to occasional collisions, maybe concatenate the document UUID and the captionStartId
... or simply handle the problem and report the collision.
Yet better, the IMF the treatment with the collision comes from isn't even on our server ... most likely Donat simply uploaded the XML ... very early days of working with GGI.
Consequentially, the Zenodo uploads we most likely done manually as well, also manually adding the attributes ... in such cases, all bets are off regarding uniqueness of identifying attributes, or DOIs in an httpUri
attribute ... sorry, that's simply legacy data we have to deal with ...
The
httpUri
s that are set to DOIs shouldn't be that way ... another case of a pending cleanup.
just did a count, I have 94 records with httpUri point to DOIs
just did a count, I have 94 records with httpUri point to DOIs
Well, I bet none of those were uploaded after mid 2016 ... this confirms my suspicion:
The above said, we have no figures that were uploaded by the TB server trough the regular channels that would have a DOI in their HTTP URI, and in extrapolation, no such figure citations, either. Conversely, the ones giving you grief are all from other sources, coming in as XML, in which case I cannot really guarantee consistency of the attributes, uniqueness, etc. ... doing our best to get this right for the Pensoft imports (also done by the TB server, if in a different group of components), but outside that, we just have to deal with the occasional inconsistency or collision ... I've come to simply consider this a fact of our lives as large scale literature data aggregators.
I continue to struggle with understanding how figures are tagged in TB. Consider the following record . The "Figures 6 - 13" appear 14 times in this XML. However, when I look at them, some of them appear to be identical and yet have two different ids. For example, the first two instances are (indented for clarity)
as you can see above, the
id
is different and theinnerText
is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having differentid
for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?From a db perspective, I currently have a treatment with a zero-many relationship to figureCitations. An additional wrinkle (that we kinda resolved earlier) is that the same figureCitation
id
can appear more than once in a treatment, and in that case, we disambiguate them withhttpUri_<n>
where n counts up from 0. To resolve that, I added a unique constraint ofid
plus this<n>
, but now I may have to rethink the design.So, to summarize, as things stand, we have a treatment that can have many figureCitations with the same
id
but pointing to differenthttpUri
s and can also have many figureCitations with differentid
but pointing to the samehttpUri
. I would like to understand whta this really means from a data-integrity point of view.@gsautter @myrmoteras @mguidoti @tcatapano