plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

possibly malformed figureCitation markup #17

Open punkish opened 2 years ago

punkish commented 2 years ago

@gsautter @myrmoteras

from treatment '0610DE634A625801C11B0C8AD642A55A..xml'

<paragraph id="5AB328B10D98652FB32CAA37477A8FF0" createTime="1425473558192" createUser="donat" createVersion="1" pageId="1" pageNumber="170" type="mainText" updateTime="1425473558192" updateUser="donat" updateVersion="1">
<figureCitation id="BF8A576EC3F6661E96B5590C108213BA" createTime="1425478130016" createUser="donat" createVersion="3" httpUri-0="http://dx.doi.org/10.5281/zenodo.15820" httpUri-1="http://dx.doi.org/10.5281/zenodo.15819" httpUri-2="http://dx.doi.org/10.5281/zenodo.15821" httpUri-3="http://dx.doi.org/10.5281/zenodo.15823" updateTime="1425479072016" updateUser="donat" updateVersion="6"><updateHistory>
<figureCitation id="BF8A576EC3F6661E96B5590C108213BA" createTime="1425478130016" createUser="donat" createVersion="3" updateTime="1425478130016" updateUser="donat" updateVersion="3"/>
</updateHistory>Figures 94-97</figureCitation>
</paragraph>

the first figureCitation tag contains another figureCitation tag with the same ID (with updateHistory). How is something like this to be parsed? I am looking for tags with a certain name, and my parser has no way of knowing that the internal figureCitation tag is not really a real tag but instead is just an update time stamp. Perhaps it would be better to use a different tag for updateHistory, perhaps figureCitationUpdateHistory? (also, in my worldview, no two tags would have the same UUID)

myrmoteras commented 2 years ago

I can't help. the XML here https://tb.plazi.org/GgServer/xml/0610DE634A625801C11B0C8AD642A55A does not show this issue:

<paragraph id="5AB328B10D98652FB32CAA37477A8FF0" pageId="1" pageNumber="170" type="mainText">
<figureCitation id="BF8A576EC3F6661E96B5590C108213BA" httpUri-0="http://dx.doi.org/10.5281/zenodo.15820" httpUri-1="http://dx.doi.org/10.5281/zenodo.15819" httpUri-2="http://dx.doi.org/10.5281/zenodo.15821" httpUri-3="http://dx.doi.org/10.5281/zenodo.15823">Figures 94-97</figureCitation>
</paragraph>
punkish commented 2 years ago

I can't help. the XML here https://tb.plazi.org/GgServer/xml/0610DE634A625801C11B0C8AD642A55A does not show this issue:

that is because my XML is coming from the historized xml zip archive

punkish commented 2 years ago

I can't help. the XML here https://tb.plazi.org/GgServer/xml/0610DE634A625801C11B0C8AD642A55A does not show this issue:

<paragraph id="5AB328B10D98652FB32CAA37477A8FF0" pageId="1" pageNumber="170" type="mainText">
<figureCitation id="BF8A576EC3F6661E96B5590C108213BA" httpUri-0="http://dx.doi.org/10.5281/zenodo.15820" httpUri-1="http://dx.doi.org/10.5281/zenodo.15819" httpUri-2="http://dx.doi.org/10.5281/zenodo.15821" httpUri-3="http://dx.doi.org/10.5281/zenodo.15823">Figures 94-97</figureCitation>
</paragraph>

also, is it ok that the figureCitation has four httpUris but no corresponding captionText?

gsautter commented 2 years ago

also, is it ok that the figureCitation has four httpUris but no corresponding captionText?

Well, the article was processed and uploaded 2015 and as an XML, and back then the attributes weren't as sophisticated just yet ... if @myrmoteras still has the IMF somewhere, we might be able to upgrade the attributes to how we currently mark them.

gsautter commented 2 years ago

the first figureCitation tag contains another figureCitation tag with the same ID (with updateHistory). How is something like this to be parsed? I am looking for tags with a certain name, and my parser has no way of knowing that the internal figureCitation tag is not really a real tag but instead is just an update time stamp. Perhaps it would be better to use a different tag for updateHistory, perhaps figureCitationUpdateHistory? (also, in my worldview, no two tags would have the same UUID)

If all you want is the current text, simply ignore the updateHistory element and everything inside it ... the updateHistory element holds previous versions of the same annotation start tag (hence the same ID) to facilitate tracking attribute changes.

punkish commented 2 years ago

If all you want is the current text, simply ignore the updateHistory element and everything inside it ... the updateHistory element holds previous versions of the same annotation start tag (hence the same ID) to facilitate tracking attribute changes.

I've figured out a way around the problem but it is not that simple as "ignore the updateHistory element". The reason is that my XML parser collects all the tags for a supplied string. So when I give it "figureCitation" it collects all the <figureCitation…> tags no matter where they are. For me to figure out whether an element in the collection is within or outside a certain other element means that I have to implement another check which is costlier in terms of performance and complexity. Anyway, as I said, I have implemented a way around.

The other, potentially bigger, issue is, having two tags with the same UUID just seems wrong. The U in the UUID is unique. In this case it is not unique even in the same file let alone universally.

Anyway, it is up to you, but my suggestion would be to implement a different mechanism for tracking update history.

Thanks.

myrmoteras commented 2 years ago

this is a one off, we might process this again - I assume we have a template for the Bulletin. https://zenodo.org/record/15815#.YcniqmjML8A

Originally, this has been a demo, where I added the figures manually. https://zenodo.org/record/15815#.YcniqmjML8A

@punkish how often have you seen a nested figure citation? a unique case or not?

Reprocessing is probably a much more time consuming process than write some code - what you seem to have done already

punkish commented 2 years ago

I caught about half a dozen (or maybe) eight errors but I haven't kept track of which ones they were. Also, I don't know if they were from the same doc or from multiple docs

myrmoteras commented 2 years ago

half a dozen in comparison to 700,000? seems not the effort to care about beyond looking into disambiguating tags?!

punkish commented 2 years ago

whether or not it is important to care about is your call; I am not the expert here. My concern is parsing the docs correctly and inserting the data in the db, and pointing out when I encounter an outside error. Also, note that I killed my process when I encountered the error… it was around the 40% mark (245K files out of 600K). I am now rerunning the process but with workaround builtin, so I don't know anymore when this occurs

myrmoteras commented 2 years ago

Why not have a process that is not killed by one single data point that is not conformant? Process all, create a list of errors, then find out which are common errors and fix them? we do not have the resources to fix all single input issues.

This is a process we succesfully apply for the taxpub output we create https://github.com/plazi/ggxml2taxpub-treatments.

gsautter commented 2 years ago

how often have you seen a nested figure citation? a unique case or not?

In the historized XML, this happens everywhere a figure citation changes after it is added, so basically in every document that has a figure on Zenodo ... the updateHistory element exists specifically to bundle the previous versions of the start tag and facilitate tracking attribute changes. As an audit log, the historized XML might be bulky to process, but is the only way of displaying the whole version history of a treatment at a glance and inside a single XML.

punkish commented 2 years ago

Why not have a process that is not killed by one single data point that is not conformant? Process all, create a list of errors, then find out which are common errors and fix them? we do not have the resources to fix all single input issues.

the short answer is that I am not as good a programmer as you probably think I am.

The longer answer is that if I know in advance how the data are then I can write a program to work around it even if the data are wrong. But if I don't know, I can't magically write a program to be resilient against all possible errors. I had absolutely no way of knowing that a tag would have another identical tag inside it with the same UUID to boot. Now that I know, I have written a way around it.

gsautter commented 2 years ago

@punkish if you process the XML like the tree it conceptually is (basically a DOM tree, just like any HTML page), you can simply navigate via the axes and ignore updateHistory and everything below that.

And if you by all means need to stick with regular expression patterns, just use the first start tag that comes up with any given ID instead of the last one.