plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

duplicate figureCitations but with different ids #36

Open punkish opened 2 years ago

punkish commented 2 years ago

I continue to struggle with understanding how figures are tagged in TB. Consider the following record . The "Figures 6 - 13" appear 14 times in this XML. However, when I look at them, some of them appear to be identical and yet have two different ids. For example, the first two instances are (indented for clarity)

<figureCitation 
    id="7D36D3D708931916C92DD6CC90AEA971" 
    captionStart="Figures 6–13" 
    captionStartId="F3" 
    captionText="Figures 6 - 13. Saigona baiseensis Zheng & Chen, sp. nov. 6 forewing 7 hindwing 8 genitalia, lateral view 9 pygofer and gonostyli, ventral view 10 pygofer and anal tube, dorsal view 11 aedeagus, lateral view 12 aedeagus, ventral view 13 aedeagus, dorsal view. Scale bars: 2 mm (6 - 10), 0.5 mm (11 - 13)." 
    figureDoi="10.3897/zookeys.1054.67004.figures6-13" 
    httpUri="https://binary.pensoft.net/fig/574424" 
    pageId="0" 
    pageNumber="185">, 6-13</figureCitation>

<figureCitation 
    id="AA129F4139163FE09785719DEFBA3687" 
    captionStart="Figures 6–13" 
    captionStartId="F3" 
    captionText="Figures 6 - 13. Saigona baiseensis Zheng & Chen, sp. nov. 6 forewing 7 hindwing 8 genitalia, lateral view 9 pygofer and gonostyli, ventral view 10 pygofer and anal tube, dorsal view 11 aedeagus, lateral view 12 aedeagus, ventral view 13 aedeagus, dorsal view. Scale bars: 2 mm (6 - 10), 0.5 mm (11 - 13)." 
    figureDoi="10.3897/zookeys.1054.67004.figures6-13" 
    httpUri="https://binary.pensoft.net/fig/574424" 
    pageId="0" 
    pageNumber="185">12</figureCitation>

as you can see above, the id is different and the innerText is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having different id for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?

From a db perspective, I currently have a treatment with a zero-many relationship to figureCitations. An additional wrinkle (that we kinda resolved earlier) is that the same figureCitation id can appear more than once in a treatment, and in that case, we disambiguate them with httpUri_<n> where n counts up from 0. To resolve that, I added a unique constraint of id plus this <n>, but now I may have to rethink the design.

So, to summarize, as things stand, we have a treatment that can have many figureCitations with the same id but pointing to different httpUris and can also have many figureCitations with different id but pointing to the same httpUri. I would like to understand whta this really means from a data-integrity point of view.

@gsautter @myrmoteras @mguidoti @tcatapano

gsautter commented 2 years ago

as you can see above, the id is different and the innerText is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having different id for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?

The id is the XML element ID of the figureCitation element proper, whereas the httpUri, captionStartId, captionStart, captionText, etc. are copied over from the caption of the cited figure (to make each individual figure citation actionable in HTML, for instance).

Hope that clarifies how things are intended to work ... multiple figureCitations with the same id attribute sounds curious to me, though, as that kind of thing should not happen ... can you provide me with an example?

punkish commented 2 years ago

as you can see above, the id is different and the innerText is different, but all other attributes are exactly the same, referring to the same images. What is the implication of having different id for the figureCitations? With respect to the parent treatment, should these be considered as two different figureCitations pointing to the exact same figure with all the attributes identical?

The id is the XML element ID of the figureCitation element proper, whereas the httpUri, captionStartId, captionStart, captionText, etc. are copied over from the caption of the cited figure (to make each individual figure citation actionable in HTML, for instance).

Hope that clarifies how things are intended to work ...

Unfortunately not yet… I don't completely understand the following statement

The id is the XML element ID of the figureCitation element proper

does that mean it is the id of the figureCitation? if not, what is the correct id to use to identify a unique figureCitation? For example, in the above cited treatment, since the same captionStart="Figures 6–13" appears 14 times, what exactly is the unique figureCitation in this case? This would help me as I would like identify and store each figureCitation as an atomic element in my db.

gsautter commented 2 years ago

does that mean it is the id of the figureCitation? if not, what is the correct id to use to identify a unique figureCitation? For example, in the above cited treatment, since the same captionStart="Figures 6–13" appears 14 times, what exactly is the unique figureCitation in this case? This would help me as I would like identify and store each figureCitation as an atomic element in my db.

Indeed, id is the ID of the figureCitation, and thus should uniquely identify a figure citation within a treatment (and parent article, for that matter). captionStart is mainly for display purposes (the "View " link, e.g. "View Figures 6-13"), as is captionText, which becomes the tooltip/title (basically the hover text) of the link. captionStartId is what logically links a figureCitation to the cited caption, referencing the startId attribute of the latter.

That said, if you want each cited figure exactly once per treatment, captionStartId is what to deduplicate by ... in fact, that's how the TreatmentBank HTML pages do it when generating the "Figures" preview area.

punkish commented 2 years ago

ok, so to summarize,

every figureCitation (with a different id) is a unique figureCitation, but its uniqueness doesn't stem from the httpUri that it points to (perhaps it stems from the innerText) as different figureCitations could point to the exact same figure and captionText, etc.

am I correct in my understanding?

gsautter commented 2 years ago

Your understanding is by and large correct, yes.

Only that the uniqueness of each individual figure citation stems from its being its own entity, its own annotation in the overall treatment text, which wouldn't be the same if you omitted either one of the figure citations. Technically, in IMFs the ID actually stems from the position of the figure citation in the page, the page ID, and the UUID of the parent article, and that applies to all annotations / XML elements.

punkish commented 2 years ago

hi @gsautter, I am revisiting this issue because I am convinced I have to rethink the design of my figureCitations table… your input would be very helpful.

Currently, I create and populate a table using the XML attributes. I ignore the inner text of the XML tags except for when I store the fulltext of the treatment. Given the example below from treatment https://tb.plazi.org/GgServer/xml/000040332F2853C295734E7BD4190F05 (I have reproduced only two figureCitations from this treatment but there are more identical ones)

<figureCitation 
    id="C1F77C1A81523CAB218B4F57FAE1C18B" 
    captionStart="Figures 2–5" 
    captionStartId="F2" 
    captionText="Figures 2 - 5. Saigona baiseensis Zheng & Chen sp. nov. 2 male, holotype, dorsal view 3 male, head and thorax, dorsal view 4 male, head, frons and clypeus, lateral view 5 male, head and pronotum, lateral view. Scale bars: 2 mm (2 - 5)." 
    figureDoi="10.3897/zookeys.1054.67004.figures2-5" 
    httpUri="https://binary.pensoft.net/fig/574423" 
    pageId="0" 
    pageNumber="185">2-5</figureCitation>
) longer than pronotum and mesonotum combined (1.45:1). Vertex (Fig.
<figureCitation 
    id="B09623422FE1A7E92015BCFC1B22562E" 
    captionStart="Figures 2–5" 
    captionStartId="F2" 
    captionText="Figures 2 - 5. Saigona baiseensis Zheng & Chen sp. nov. 2 male, holotype, dorsal view 3 male, head and thorax, dorsal view 4 male, head, frons and clypeus, lateral view 5 male, head and pronotum, lateral view. Scale bars: 2 mm (2 - 5)." 
    figureDoi="10.3897/zookeys.1054.67004.figures2-5" 
    httpUri="https://binary.pensoft.net/fig/574423" 
    pageId="0" 
    pageNumber="185">3</figureCitation>

Given the XML above, my figureCitations table ends up with duplicate records where the only difference is the figureCitationId. I am thinking that I should perhaps also store the innerText in order to make these records really distinct. Otherwise I don't really see how this information could be useful for any analysis. What do you think? Or, could you add the innerText also as an attribute of the figureCitation tag?

Further, if I create another table with distinct httpUris, I will have a one-to-many relationship between httpUri and figureCitationId. In a case like above, I have one httpUri and several related figureCitations with different Ids but otherwise everything else being identical. Could I also have related figureCitations where some attribute, for example, captionText, is different. How do you suggest I handle the data in such cases?

Thanks

gsautter commented 2 years ago

The id of an annotation identifies the annotation proper ... if you only want the cited figures, not the individual citations, I'd go for the httpUri attribute (the figureDoi might not always be present, as some source documents, especially in TaxPub, might come with the former, but without the latter).

gsautter commented 2 years ago

Could I also have related figureCitations where some attribute, for example, captionText, is different. How do you suggest I handle the data in such cases?

No ... the captionText always is the same for the citation of the same figure, as is the httpUri, as well as the captionStart.and captionStartId (the latter is the "foreign key" joining figure citations to figures, specifically the captions associated with them).

punkish commented 2 years ago

consider the following https://tb.plazi.org/GgServer/xml/03A25264CA15FFFAEF37FB524211FE1F where the httpUri is the same but the captionText is different (as are the captionStartId and captionStart

<figureCitation id="1330FFF7CA16FFE5EC2AFECF434CFE8F" box="[520,585,290,317]" captionStart="Figure 1" captionStartId="15.[751,820,1352,1377]" captionTargetBox="[159,708,1067,1550]" captionTargetId="figure@15.[156,716,1049,1550]" captionTargetPageId="15" captionText="Figure 1. Male cerci in dorsal view (left) and paraprocts in ventral view with dotted outline of cerci (right) of Umma gumma sp. nov. and U. longistigma." httpUri="http://dx.doi.org/10.5281/zenodo.35441" pageId="15" pageNumber="460">
Fig.
<number id="3DC3F155CA16FFE5EC1EFECF434CFE8F" box="[572,585,290,317]" pageId="15" pageNumber="460" value="1.0">1</number>
</figureCitation>

and

<figureCitation id="1330FFF7CA15FFE6EFF4FAE64331FA94" box="[470,564,1291,1318]" captionStart="Photo 1" captionStartId="16.[159,225,1096,1120]" captionTargetBox="[159,1035,459,1055]" captionTargetId="figure@16.[159,1038,459,1058]" captionTargetPageId="16" captionText="Photo 1. Umma gumma, male; Moyabi, Gabon. Photo: JK (24 - ix- 2013)" httpUri="http://dx.doi.org/10.5281/zenodo.35441" pageId="12" pageNumber="457">
(Type Photo
<number id="3DC3F155CA15FFE6EC06FAE64331FA94" box="[548,564,1291,1318]" pageId="12" pageNumber="457" value="1.0">1</number>
</figureCitation>

which brings me to this specific treatment which seems to be identical to https://tb.plazi.org/GgServer/html/03A25264CA08FFFCEEC8FEA94509FE3A except for that Umma Gumma bit

punkish commented 2 years ago

The id of an annotation identifies the annotation proper ... if you only want the cited figures, not the individual citations, I'd go for the httpUri attribute (the figureDoi might not always be present, as some source documents, especially in TaxPub, might come with the former, but without the latter).

I will keep the figureCitations table but build another derived table with httpUri as the primary key. Which brings me to the httpUri itself. Mostly it points to a figure on Zenodo. In some cases it points to a figure on Pensoft. But, sometimes it is a DOI. That is problematic for an application like Ocellus as the DOI can't be used in an <img> tag. Since a DOI is really a pointer to the original record, it would be helpful if httpUri didn't use it but was always, consistently pointing to the image. Perhaps the DOI could be stored as a supplemental column

gsautter commented 2 years ago

consider the following https://tb.plazi.org/GgServer/xml/03A25264CA15FFFAEF37FB524211FE1F where the httpUri is the same but the captionText is different (as are the captionStartId and captionStart

Well, this simply looks like an error to me ... what I described earlier is how the data should be ... this duplicate case here is simply wrong. Looking at the upload date (2015-12-13), it's a very early IMF, and also a very early Zenodo export (the five digit deposition number 35441 says it all).

I suggest you simply go with what I described above, as such errors will be sorted out eventually.

gsautter commented 2 years ago

I will keep the figureCitations table but build another derived table with httpUri as the primary key. Which brings me to the httpUri itself. Mostly it points to a figure on Zenodo. In some cases it points to a figure on Pensoft. But, sometimes it is a DOI. That is problematic for an application like Ocellus as the DOI can't be used in an <img> tag. Since a DOI is really a pointer to the original record, it would be helpful if httpUri didn't use it but was always, consistently pointing to the image. Perhaps the DOI could be stored as a supplemental column

The Zenodo httpUris are from our IMFs, which we upload ourselves, whereas the Pensoft httpUris come in with the TaxPub, and we never even touch the figure. The httpUris that are set to DOIs shouldn't be that way ... another case of a pending cleanup.

If the httpUri doesn't work for you as a primary key due to occasional collisions, maybe concatenate the document UUID and the captionStartId ... or simply handle the problem and report the collision.

gsautter commented 2 years ago

Yet better, the IMF the treatment with the collision comes from isn't even on our server ... most likely Donat simply uploaded the XML ... very early days of working with GGI. Consequentially, the Zenodo uploads we most likely done manually as well, also manually adding the attributes ... in such cases, all bets are off regarding uniqueness of identifying attributes, or DOIs in an httpUri attribute ... sorry, that's simply legacy data we have to deal with ...

punkish commented 2 years ago

The httpUris that are set to DOIs shouldn't be that way ... another case of a pending cleanup.

just did a count, I have 94 records with httpUri point to DOIs

gsautter commented 2 years ago

just did a count, I have 94 records with httpUri point to DOIs

Well, I bet none of those were uploaded after mid 2016 ... this confirms my suspicion: image

gsautter commented 2 years ago

The above said, we have no figures that were uploaded by the TB server trough the regular channels that would have a DOI in their HTTP URI, and in extrapolation, no such figure citations, either. Conversely, the ones giving you grief are all from other sources, coming in as XML, in which case I cannot really guarantee consistency of the attributes, uniqueness, etc. ... doing our best to get this right for the Pensoft imports (also done by the TB server, if in a different group of components), but outside that, we just have to deal with the occasional inconsistency or collision ... I've come to simply consider this a fact of our lives as large scale literature data aggregators.