plazi / arcadia-project

2 stars 1 forks source link

vocabulary: xpath and discovering data #52

Open myrmoteras opened 5 years ago

myrmoteras commented 5 years ago

@punkish might one reason, that you seem to find less data in TB XML documents than Terry and Guido is that you make use only of parent-child relationships and not anything further downstream, that is all the descendants?

punkish commented 5 years ago

Could you please give me an example that illustrates what you are saying? I am not sure I understand fully what you mean by “parent-child” and “descendants.” I simply look for the tags and attributes that you have prescribed. If they are there, my program finds them.

cc @mguidoti

gsautter commented 5 years ago

Take the following XML snippet:

<treatment>
<subSubSection type="nomenclature">
<paragraph>
<heading>
<taxonomicName>Aus bus</taxonomicName>
</heading>
</paragraph>
</subSubSection>
...
</treatment>

Now you can walk through all the elements down the XML tree, finding the taxonomicName as a child of heading (which in turn is a child of paragraph, etc.), which is what I believe you're doing, treating the XML DOM tree pretty much like a JavaScript an object graph.

The XML / XPath way of doing this would be going for any taxonomicName descendant of the "nomenclature" subSubSection, simply expressed as //treatment//subSubSection[./@type = 'nomenclature']//taxonomicName, simply "jumping over" any elements that might be in between subSubSection and taxonomicName.

With the latter, intermediate elements don't make any difference and don't get in the way. Just have to use XPath instead of JavaScript object graph navigation.

punkish commented 5 years ago

yeah, that is not an issue. I can look for immediate children or any children of a node. JavaScript selectors are very flexible.

I look for subSubSection[type=nomenclature] taxonomicName which happily finds the tag. If there is a particular example that @myrmoteras has showing where @gsautter and @tcatapano have found more stuff than I might have, I would love to look at it.

Look, it is entirely possible that there is some error in my programming. However, keep in mind, I am finding a lot of data, just not all of data. If there were an error, for example, wrong tag/attribute coding, the program would fail for all instances. But that is not what is happening. Of course, I will go over the program again to make sure, but I have a suspicion that my problem really is that I am simply not looking for all the spelling variations that may have crept in over time.

tcatapano commented 5 years ago

If traversal of descendant nodes is not an issue, then I do not see much a problem for the immediate task of writing a transform based on the GGXML treatment document extract.

As shown by the xpaths Ive begun adding to the Data Dictionary spreadsheet (https://docs.google.com/spreadsheets/d/10uluNbkcu0CfNRog_uOnx_6ytXUSvdH6gx9xltGPqlk/edit?usp=sharing)

the variation of type values for subSectionSection elements as well as variation in elements within the paths to target elements does not come into play. In english the task is:

  1. at the document node get the values of a few known attributes, processing a few conditionally (e.g. for the output pubdate field)
  2. go to the descendant treatment element in the document element (if there are more than two treatment elements, stop and log an error)
  3. go to the first descendant taxonomicName in the first descendant subSubSection of type treatment (if there is none, stop and log an error)
  4. get the values from a few of the taxonomicName's attributes
  5. go to each each materialsCitation element in the present treatment element
  6. get the values from a few of the materialsCitation attributes etc...

Doing this will result in many fields with missing data, mainly because the data simply is not present in the original document or because it hasnt been marked up in the GGXML. Both errors are out of scope for the present task. There may be missing data because a significant element or attribute name or attribute value has been spelled incorrectly or varies and these can and should be fixed, but the analysis weve already performed shows that the frequency for this is very low. There may be missing data which is the result of variations in markup, but starting with the XPaths we feel should have "all" the data and going ahead with the developing the transformation of the GGXML extract to the target data dictionary form and iteratively evaluating the results and refining as we proceed is probably the best course of action now. We've done probably a sufficient amount of a priori specification

punkish commented 5 years ago

Could you please give me an example that illustrates what you are saying? I am not sure I understand fully what you mean by “parent-child” and “descendants.” I simply look for the tags and attributes that you have prescribed. If they are there, my program finds them.

On May 30, 2019, at 11:23 AM, Donat Agosti notifications@github.com wrote:

@punkish might one reason, that you seem to find less data in TB XML documents than Terry and Guido is that you make use only of parent-child relationships and not anything further downstream, that is all the descendants?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mguidoti commented 5 years ago

Terry, I noticed you're using the outdated version of the data dictionary. I sent you guys an e-mail with a link to an updated version of it, that already is an attempt to have the paths in XML lingo (not perfect xpaths, though). I did this to help you guys understand what is inside Zenodeo without having to understand that node.js path. This means that the differences that you might find between the old data dictionary and your xpaths might have been already fixed. Here's the link, again: https://docs.google.com/spreadsheets/d/1m_iTigaD5GDfJGloJa5yJV4-W0qb1dKI9yCfvpWecJA/edit?usp=sharing

In my opinion, the current data dictionary is missing, consistently, the authors of the treatment citations. That's one, and perhaps the only, problem to solve in terms of xpaths. Remember: Puneet didn't come up with the paths all alone. Donat and I were looking at the XMLs to build the initial version of the data dictionary.

Also, Puneet and I've the advantage to be actually looking at the output of Zenodeo data extraction. And I can say that we're finding lots of things, which means, the current paths used in the current data dictionary work. We can't say that the paths in the updated data dictionary are completely wrong. The problem is that we can't say, for the treatments that come up without treatments citations or material citations (for example), that they don't have this information or that this information is stored in a different path for historical reason. I'll try to explain myself again on this one.

Consider this hypothetical case: maybe in 2017 you guys used a different path to store the data related to references (or any other data, like material citations) in the xml. If this has happened to any data that we're extracting (again, refer to the updated version of the data dictionary, please), then we need to know all alternative paths. That's the question we've been asking from the beginning. Have different paths, than the ones observed in the updated data dictionary, ever been in place?

If the paths described in the updated version of the data dictionary are the same used since the beginning, and the only existing variation are misspellings like the ones caught in our reports, then we can say that the treatments missing data are actually missing those things. That's justified by both frequency (very low) or origin (perhaps Pensoft xmls) of the observed variation in those reports. And by the confirmation that alternative paths doesn't exist (the question we've been asking).

When we (at least me) talk about data normalization, I'm trying to say: ok, so, let's agree in one set of paths, put a version number on it, so we know, 100% sure, that from now on we'll always be looking at the right places.

That's all. If there are alternative xpaths, we need to know. If there aren't, Zenodeo is already catching all that is out there to catch - except maybe for treatment citation authors in my opinion - and we can say this because we look at the Zenodeo extraction's output and we can see that it works for lots of treatments.

tcatapano commented 5 years ago

Agreed. We are trying to confirm the locations of the information corresponding to Zenodo data elements and their paths in the source GGXML.

I can can confirm that it is my understanding that the general strategy I outlined above is sound -- basically, go to a few key elements (document, treatment, 1st taxonomic name inside of 1st nomenclature, materialsCitation, etc...) and gather the values of a few expected attributes on those elements. I am still looking into bibrefCitation, figureCitation, and treatmentCitation, but the same general strategy should apply.

As for specific xpath locations. The ones I gave in the data dictionary document I cited should be the ones to use (with the proviso that they might need tweaking). @gsautter and @myrmoteras should say if they know of other locations. I was wondering why there were two data dictionaries and because a) Donat referred to the previous one and b) because I could not edit or comment on the new one I added the specific xpaths to the old one. I will do the same for the new

punkish commented 5 years ago

The ones I gave in the data dictionary document I cited should be the ones to use (with the proviso that they might need tweaking).

yes, indeed… the all important proviso that they might need tweaking. That is exactly what I need to know – all the tweakz. Once I have them, I am good to go.

So, soon as you are done, @tcatapano, and soon as you have done the tweaking, @gsautter and @myrmoteras, could you please ping @mguidoti and me so we can proceed with our data extraction.

Muito obgridao, as they whisper in the bolarias of Porto Alegre.

tcatapano commented 5 years ago

BTW it is not that all the xpaths will need tweaking. Only a very few of them are likely to.Besides the tweaking is something best done after the proposed xpaths are implemented. Could you go ahead and implement what I will shortly put into the new data dictionary and be prepared to revise and refine as we encounter and resolve the inevitable errors? But ultimately I leave it to you...

punkish commented 5 years ago

BTW it is not that all the xpaths will need tweaking. Only a very few of them are likely to.Besides the tweaking is something best done after the proposed xpaths are implemented. Could you go ahead and implement what I will shortly put into the new data dictionary and be prepared to revise and refine as we encounter and resolve the inevitable errors? But ultimately I leave it to you

Yes, of course. I understand that only a few of the paths will (may) need tweaking. But even if one path needs tweaking then I need that tweaked. @myrmoteras and @gsautter (and you) have the institutional memory to help us sniff those buggers out.

Happy to run a trial run with what you have (which is actually very much what I already have). Keep in mind though, running data extraction over 300K files is not super-enjoyable, so I'd rather do it as few times as possible so I can move on to more fun and interesting tasks such as building innovative info-retrieval programs and interfaces.

Onward and forward…

tcatapano commented 5 years ago

Why not run the next few iterations over a sample? I've been using using a 20%/60k sample. Not much to gained by developing over the full dataset at this point

punkish commented 5 years ago

Why not run the next few iterations over a sample? I've been using using a 20%/60k sample. Not much to gained by developing over the full dataset at this point

well, two reasons:

  1. If I run my program over a sample, I may never hit the problematic XMLs, so I will never discover them. All this angst (at least on my part) is to ensure that I am getting out all the data that I should, not just all the data that I can.

  2. My intention is a bit different (beyond) merely testing the capability of my extraction program. I am also writing APIs for querying info and building the Ocellus interface for displaying the results. The more data I have, the more varied and fun things I can do.

Just let me know when you are done with your pass at the XML paths. Then I will have at it.

myrmoteras commented 5 years ago

From a TreatmentBank part, allmost all the data is accessible and discoverable through the Stats http://plazi.org/api-tools/statistics/ , eg treatment statistics or article statistics. All the sources can be referenced by adding "document UUID" or "Article UUID" to the search. If you can do the same to the search you are running, then you can compare the results using the UUID, and we would most likely immediately know, whether you get the same or not, and if so, something is missing, what exactly is missing. We (you, Guido) can find out why something is missing.

Doable?

This seems to be a really powerful too to get rid of our guessing that we might miss something.

punkish commented 5 years ago

ok, so given I have been making all these reports for you guys, it is now my turn to ask for a report from you. Could you please send me a simple spreadsheet listing the total number of treatments, materialCitations, treatmentCitations, treatmentAuthors, figureCitations, and bibRefCitations. This would be only one column with 6 rows.

And, if you feel adventurous to create a more advanced report, you could send me a not-so-simple spreadsheet (or CSV) listing all the treatments, and for each treatment, the total number of materialCitations, treatmentCitations, treatmentAuthors, figureCitations, and bibRefCitations. This would be about 300K rows for the treatments, and 5 columns for each of the related attributes.

Many thanks in advance.

myrmoteras commented 5 years ago

The more data I have, the more varied and fun things I can do.

@punkish please understand, that first there is nothing comparable to what we have to offer out there. So any step we do is relevant, and in indeed it is important that we start putting it out and be visible. Second, we look at possible 0.1% of all the treatments our there. So even our sample might have to change over time, because we discover new things, find better ways to tag the data. Third, we can change tags, tagging over all the corpus and with that and analyses like yours, collaboration with GBIF and others we find reasons to change, to improve, to correct data. We are grateful to this, abut also aware, that we every day ask us, can we do better.

Look at GBIF: they use our data, and through this, we get input NOW from publishers, asking for particular changes, GBIF becomes aware that our materialsCitations are not really their understanding of occurrence, to which MC is currently mapped. In this case, they decided to create an element MaterialsCitation.

The cools thing in this world is that we are at the brink to discover what's in 500,000,000 pages, we can find out how this thinking in this domain is structured.

So, what is relevant to spurn this fledgling, or not yet even existing drive, is to show what we do. Even if we are aware of issues, have susipision, are fixing issues, I think this should not stop from working in applications using this data. building the BLR website, until we come up with standards that, as pointed out above, exist only in a loose sense, beside some element, that come up all the time.

myrmoteras commented 5 years ago

send me a simple spreadsheet listing the total number of treatments, materialCitations, treatmentCitations, treatmentAuthors, figureCitations, and bibRefCitations. This would be only one column with 6 rows.

@punkish you can create all the reports using the stats pages. Please learn them to use. This is a basic system understanding

http://tb.plazi.org/GgServer/srsStats

image

if it doesn't work, because this might be a pritty large result, then talk to @gsautter so it can be generated

http://tb.plazi.org/GgServer/dioStats/stats?outputFields=cont.treatCount+cont.matCitCount+cont.matCitCountHttpUri+cont.bibRefCount&format=HTML

and if something is that you need and do not get, please write a feature request in https://github.com/plazi/Plazi-Communications/issues

stats (1).zip

tcatapano commented 5 years ago

And, if you feel adventurous to create a more advanced report, you could send me a not-so-simple spreadsheet (or CSV) listing all the treatments, and for each treatment, the total number of materialCitations, treatmentCitations, treatmentAuthors, figureCitations, and bibRefCitations. This would be about 300K rows for the treatments, and 5 columns for each of the related attributes.

@punkish See

https://github.com/tcatapano/ggxml-to-treatment-data/blob/master/reports/tmt-counts-full.csv

for a table with the following counts for each (well formed) GGXML treatment documents (c. 291500) in the plazi xml dump:

the columns are the following:

tmt_UUID | authors | materialsCitations | treatmentCitations | figureCitations | bibRefCitations | preceding-treatments

The data is also available in an xlsx file:

https://github.com/tcatapano/ggxml-to-treatment-data/blob/master/reports/tmt-counts-full.xlsx

Some summary stats:

per treatment

stat authors materialsCitations treatmentCitations figureCitations bibRefCitations
min 1 0 0 0 0
max 112 490 1351 630 1960
median 2 0 0 0 1
mean 6.0906 0.8626 0.6824 2.4808 3.5337
stdDev 18.5723 4.7292 6.7633 7.1286 13.7843
kurtosis 27.9123 1353.7611 12582.7627 347.1731 5149.1692

total counts of treatments with values for fields:

stat authors materialsCitations treatmentCitations figureCitations bibRefCitations
non-null 291535 52638 47001 87681 162034
null 0 238897 244534 203854 129501
pct non-null 100.00% 18.06% 16.12% 30.08% 55.58%
pct null 0.00% 81.94% 83.88% 69.92% 44.42%

the report was generated by this xquery script:

https://github.com/tcatapano/ggxml-to-treatment-data/blob/master/lib/counts_by_treatment.xq

I will be continuing to work in that repository to develop tools for extraction and analysis of the treatment GGXML to the Zenodo Treatment Data Dictionary. You should all have been invited as collaborators.

I am using the BaseX XML database (http://basex.org/) to perform the querying on both local macbook air (for development) and a Digital Ocean droplet (for the full dataset)

punkish commented 5 years ago

thanks @tcatapano. Now @mguidoti and I are looking through your result and will get back to you if we have questions. For now, given the following:

stat authors materialsCitations treatmentCitations figureCitations bibRefCitations
non-null 291535 52638 47001 87681 162034

Does the above mean that there are only 52K materialsCitations? Which is weird, because I was able to extract 129275, and I thought I wasn't getting enough. On the other hand, maybe the above means that there are only 52K treatments with materialsCitations. In that case, I'd also like to know how many total materialsCitations are there per your XML definitions.

cc @gsautter @myrmoteras

tcatapano commented 5 years ago

@punkish : no these figures are per treatment; that is, there are 52638 treatments (18.06% of all treatments), which have any materialsCitation elements.

According to the stats page @myrmoteras had referred to above, there are 221735 materialsCitations across the entire corpus, but roughly 18% of treatments account for all of them

tcatapano commented 5 years ago

I'd also like to know how many total materialsCitations are there per your XML definitions.

That would be the sums of each column in the table:

authors materialsCitations treatmentCitations figureCitations bibRefCitations
1775629 231339 210565 714932 1069277

my script first finds all descendant treatments and then for each counts the relevant descendant elements including materialsCitations

https://github.com/tcatapano/ggxml-to-treatment-data/blob/master/lib/counts_by_treatment.xq


for $tmt in //treatment

return 

concat($tmt/ancestor::document/@docId, '|',
count($tmt/ancestor::document/*:mods//*:name[*:role/*:roleTerm = 'Author']), '|',
count($tmt//materialsCitation), '|', 
etc...

I'll look again to see if I can account for the discrepancy. There's a couple minor revisions I want to make anyway...

tcatapano commented 5 years ago

One bit of noise that concerns me is cases in which the significant elements are nested inside an element of the same name, e.g. a materialsCitation which is the descendant of a materialsCitation. It doesnt happen too much, but it does happen:

//materialsCitation//materialsCitation : 820 //treatmentCitation//treatmentCitation : 180 //bibRefCitation//bibRefCitation: 933 //figureCitation//figureCitation): 75

and I've already known about the phenomenon of treatments inside of treatments:

//treatment//treatment: 266

Files containg this pattern should not be processed and logged as errors. Not sure about what to do anout the pther nested elements. Probbaly should also skip processing the entire file. I'll add these patterns to the schematron schema Im developing (https://github.com/tcatapano/ggxml-to-treatment-data/blob/master/lib/ggxml4zdd.sch)

punkish commented 5 years ago

I'd also like to know how many total materialsCitations are there per your XML definitions.

That would be the sums of each column in the table:

authors materialsCitations treatmentCitations figureCitations bibRefCitations 1775629 231339 210565 714932 1069277

oh boy, my database is certain way off then. Here is what I have (mind you, this is from ~245K treatments from about 6 months ago)

treatmentAuthors: 1131467 materialCitations: 129275 treatmentCitations: 0 figureCitations: 635532 bibrefCitations: 802930

tcatapano commented 5 years ago

FWIW, my source data was a copy of http://tb.plazi.org/GgServer/dumps/plazi.xml.zip which I made yesterday. It's about 291K treatments. The expected difference would be about 15%, so with the exception of figureCitations the numbers are pretty off.

mguidoti commented 5 years ago

Puneet asked me to add the link to the stats comparison spreadsheet here.

So, here it is.