neuropoly / bibeasy

Set of tools to manage academic bibliography
Apache License 2.0
2 stars 0 forks source link

Add a feature to update the XML file based on the CSV information #10

Open jcohenadad opened 3 years ago

jcohenadad commented 3 years ago

Currently, when there is a mismatch, fields are manually updated on the CCV website.

Example of output ```console julien-macbook:~/Desktop $ bibeasy -x CCV-98720.xml -c publications_article.csv -t article Reading input file: 'publications_article.csv'... Publication type: 'article' GSHEET J1 CCV J153 Activation detection in diffuse optical imaging by means of the general linear model Mismatched fields: Journal/Conference GSHEET J2 CCV J150 Development of clinical diffusion tensor MRI of the spinal cord in a context of spinal cord injury GSHEET J3 CCV J151 In vivo DTI of the healthy and injured cat spinal cord at high spatial and angular resolution Mismatched fields: Authors, Journal/Conference GSHEET J4 CCV J152 Detection of multiple pathways in the spinal cord using q-ball imaging Mismatched fields: Authors, Journal/Conference GSHEET J5 CCV J148 Investigations on spinal cord fMRI of cats under ketamine Mismatched fields: Authors, Journal/Conference GSHEET J6 CCV J149 Characterization of cardiac-related noise in fMRI of the cervical spinal cord Mismatched fields: Authors GSHEET J7 CCV J145 BOLD signal responses to controlled hypercapnia in human spinal cord Mismatched fields: Authors, Journal/Conference ```

What would be useful is if bibeasy could update the XML, so that I would then import the XML into CCV to update the publication records.

kousu commented 3 years ago

After looking at this for a couple hours, the biggest stumbling block to this is that the XML file gets format converted to a DataFrame:

https://github.com/jcohenadad/bibeasy/blob/eed3d61fa46aac29f2355cac5cd63d4a604d411d/bibeasy/scripts/bibeasy_cli.py#L125-L126

and all other processing is done in that format:

https://github.com/jcohenadad/bibeasy/blob/eed3d61fa46aac29f2355cac5cd63d4a604d411d/bibeasy/utils.py#L167-L170

This makes working backwards tricky, because you can't directly look at the relevant publication in the XML when you find a mistake. In fact the XML object has been thrown away by that point.

In https://github.com/jcohenadad/bibeasy/pull/11/commits/f261436463334efc02644c977a952098c7575a2e I make the DataFrame retain the IDs used in the XML, so at least there's a chance of going from one to the other.

As a note to seld, some avenues that could be explored:

  1. Maybe I can call find_matching_ref() but then feed its results into a new function that re-parses the XML file a second time and fixes
  2. Move the XML parsing from xml_to_df into find_matching_ref; pandas is being used as a query language there currently but it should be equally easy to use XPath there. Maybe even clearer? XPath is not super clear but it's one less abstraction to work with. And from there maybe find_matching_ref can be given an extra flag, or maybe split into subroutines, so that one path does XML editing and the other
  3. ?
kousu commented 3 years ago

Some questions:

(I could do this feature without touching those features but it would be easier if I had a freeer hand to rearrange the existing code)

jcohenadad commented 3 years ago

This makes working backwards tricky, because you can't directly look at the relevant publication in the XML when you find a mistake. In fact the XML object has been thrown away by that point.

I know, this is a bit of a pain, which is why i reached out to a smart software engineer 😅

is -xd ever used? Can I remove it?

yes, it could be used. Typical scenario: i write a grant on sept 2021 and i format all references using CCV's ref. On March 2022 I write another grant and I want to reuse the text from sept 2021: I will need to update the references by matching CCV's sept 2021 with CCV's march 2022.

is --to-gsheet ever used? Can I remove it?

yes, it could also be used, although it is more rare. Scenario: the grant i wrote in sept 2021 using CCV's refs should now be exported using CSV's gsheet references system in order to "standardize" it for future use.

is anything besides scripts/bibeasy_cli.py ever used? Can I remove the rest of the scripts?

hum, i'm not sure, i would have to dig but i don't have the time right now. I know we had some scripts to format the references into a dokuwiki to have them display on our website-- but we are not on dokuwiki anymore. Also, when i refactored bibeasy_cli, it should now be able to output a formatted document for our markdown new website. So i think it is safe to remove, but again, i'm not 100% sure... @alexfoias can you pls chip in?

alexfoias commented 3 years ago

I was using bibeasy -o wiki.txt --reverse before for outputting the dokuwiki format. I didn't check what is the format the we use in the gitbook (I assume is markdown ?) right now.

Ideally we would use bibeasy directly on the gitbook to fetch the latest version of publications.

kousu commented 3 years ago

Some reverse engineering:

I made a test account on https://ccv-cvc.ca/ and stripped down @jcohenadad's CV to a sample of 3 publications and used it to generate a sample PDF CV. The PDFs it outputs have publications numbered, and I've convinced myself from this that these numbers are ordered not by the order in the XML file, but by the order of recordId; internally I am pretty sure they're keeping all the CVs in a big SQL database, where XML is meaningless.

With recordId in ['2ce60c2265954dec9b010d918605eabd', '38d1815fcf6143919814e40f1ce76b92', '5a66bf5d63cd4c37a848d3df602832d6'] in CCV-10206959-with-publications.xml.txt

I get this: CCV-TestyTesterton.pdf

which also displays in the same order in their UI:

Screenshot 2021-10-13 at 18-27-21 Welcome to the Canadian Common CV

but with editing the XML so that I have recordId in ['2ce60c2265954dec9b010d918605eabd', '4a66bf5d63cd4c37a848d3df602832d6', '38d1815fcf6143919814e40f1ce76b92'] in CCV-10206959-with-publications.xml.txt

I get CCV-TestyTesterton-6.pdf

Screenshot 2021-10-13 at 18-51-34 Welcome to the Canadian Common CV

and if you re-export the XML you get CCV-10206959-with-publications-reexported.pretty.xml.txt sorted as ['2ce60c2265954dec9b010d918605eabd', '38d1815fcf6143919814e40f1ce76b92', '4a66bf5d63cd4c37a848d3df602832d6'].

So what does this tell me? It tells me that the "CCV IDs" that bibeasy talks about are not incorrect but also not quite accurate. The CCV ID is actually this 128-bit, apparently random (and user-choosable), number like 2ce60c2265954dec9b010d918605eabd, but when printed onto the ID gets mapped by from .recordId -> i by something like

for i, publication in enumerate(sorted(publications, key=lambda publication: publication.recordId), 1):
    print(i, publication.title)

(and notice that it counts from 1).

This also means that the IDs are not fixed, but can potentially change if an XML file is uploaded that adds an ID number that happens to fall in between previous ones; or perhaps it can happen just upon adding a new entry with the UI. Which probably explains a large part of why bibeasy got written in the first place.

kousu commented 3 years ago

Some deeper reverse engineering:

This will add 10000 Publications, titled "c0001" through "c10000"

JSESSIONID=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # get this by logging in in your browser then looking in your Web Inspector to get your cookies

seq 10000 | while read i; do
curl 'https://ccv-cvc.ca/researcherProcManageGenericCV-eng.frm' \
 -b "ROUTEID=.1; JSESSIONID=$JSESSIONID" \
 -F submitType=gridAction \
 -F submitName=ADD \
 -F submitParameters=field9a34d6b273914f18b2273e8de7c48fd6 \
 $(seq $((2 + $i)) | while read i; do echo -F field9a34d6b273914f18b2273e8de7c48fd6-$i-0=true; done)
  # 9a34d6b273914f18b2273e8de7c48fd6 is their section ID for "Journal Articles"

curl 'https://ccv-cvc.ca/researcherProcManageGenericCV-eng.frm' \
 -b "ROUTEID=.1; JSESSIONID=$JSESSIONID" \
 -F submitType=action \
 -F submitName=Done \
 -F submitParameters= \
 -F fieldf3fd4878d47c4e83aef6959620ba4870=c$(printf "%05d" $i)

  # f3fd4878d47c4e83aef6959620ba4870 is their field ID for "Article Title"
done

And for completeness, this will delete the $Nth Publication:

curl 'https://ccv-cvc.ca/researcherProcManageGenericCV-eng.frm' \
 -b "ROUTEID=.1; JSESSIONID=$JSESSIONID" \
 -F submitType=gridRowAction \
 -F submitName=DELETE \
 -F submitParameters=field9a34d6b273914f18b2273e8de7c48fd6,$N \
 $(seq 10000 | while read i; do echo -F field9a34d6b273914f18b2273e8de7c48fd6-$i-0=true; done)

  # 9a34d6b273914f18b2273e8de7c48fd6 is their section ID for "Journal Articles"
  # the seq | is there because otherwise the Submit checkboxes all get deselected

I ran this for a while; it only got to 89, not 10000, but I think that's enough to demonstrate that indeed the CCV order is not robust:

Screenshot 2021-10-13 at 20-59-31 Welcome to the Canadian Common CV

So my conclusion from this is: you should always be using your own, robust, ID numbers, those that you keep in the gsheet, when writing anything manually, and you should be keeping those files around as source documents, clearly demarcated different than the bibeasy post-processed version. It's unfortunate that the format of the citation keys in the gsheet and the most natural abbreviated format for CCV are both \[[JC][[:space:]]?[[:digit:]]+\]. If they were different there would be much less chance of confusion between the versions, and perhaps less temptation to throw one or the other away. And --to-gsheet and -xd are enabling bad habits.

Also this is sort of off-topic. Sorry about that.

jcohenadad commented 3 years ago

amazing investigations @kousu !!!

This also means that the IDs are not fixed, but can potentially change if an XML file is uploaded that adds an ID number that happens to fall in between previous ones; or perhaps it can happen just upon adding a new entry with the UI. Which probably explains a large part of why bibeasy got written in the first place.

yup!

So my conclusion from this is: you should always be using your own, robust, ID numbers, those that you keep in the gsheet, when writing anything manually, and you should be keeping those files around as source documents, clearly demarcated different than the bibeasy post-processed version.

I also came to that realization, and this is what i've been trying to do for the past 2 years.

kousu commented 3 years ago

Now that I know about the way CCV handles recordId, I want to know if I can create new records so that we can copy unmatched gsheet records in (of which I think there's currently 137, or 135, or 131 in @jcohenadad's account, depending on how accurately you do the matching).

Do we need to pick an unused recordId? How can I do that safely?

In my test account, I exported a current copy of my records, then made it usable with prettyxml:

kousu@ail:~/src/neuropoly/bibeasy$ ./prettyxml CCV-10206959.xml > CCV-10206959.xml_ && mv CCV-10206959.xml_ CCV-10206959.xml

Here is it: CCV-10206959.xml

I added a section to it, without adding the recordId attribute; here that is: CCV-10206959-addition.xml

kousu@ail:~/src/neuropoly/bibeasy$ diff -u --color CCV-10206959.xml CCV-10206959-addition.xml
--- CCV-10206959.xml    2021-10-19 01:41:10.153386693 -0400
+++ CCV-10206959-addition.xml   2021-10-19 01:42:26.811823964 -0400
@@ -154,6 +154,71 @@
                    </bilingual>
                </field>
            </section>
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+               <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field id="5c04ea4dae464499807d0b40b4cad049" label="Journal">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field id="0a826c656ff34e579dfcbfb373771260" label="Volume">
+                   <value type="String">77</value>
+               </field>
+               <field id="cc1d9e14945b4e8496641dbe22b3448a" label="Issue">
+                   <value type="String">5</value>
+               </field>
+               <field id="00ba1799ece344dc8d0779a3f05a4df8" label="Page Range">
+                   <value type="String">1703-1713</value>
+               </field>
+               <field id="3b56e4362d6a495aa5d22a1de5914741" label="Publishing Status">
+                   <lov id="00000000000000000000000100001704">Published</lov>
+               </field>
+               <field id="6fafe258e19e49a7884428cb49d75424" label="Year">
+                   <value format="yyyy" type="Year">2021</value>
+               </field>
+               <field id="4ad593960aba4a21bf154fa8daf37f9f" label="Publisher">
+                   <value type="String"/>
+               </field>
+               <field id="4c3bc805ceaa42259f014514fc4905f8" label="Publication Location"/>
+               <field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
+                   <value type="Bilingual"/>
+                   <bilingual>
+                       <french/>
+                       <english/>
+                   </bilingual>
+               </field>
+               <field id="478545acac5340c0a73b7e0d2a4bee06" label="URL">
+                   <value type="String">https://pubmed.ncbi.nlm.nih.gov/33775122/</value>
+               </field>
+               <field id="2089ff1a86844b6c9a10fc63469f9a9d" label="Refereed?">
+                   <lov id="00000000000000000000000000000400">Yes</lov>
+               </field>
+               <field id="51b7eaff05444990af823b9d80924f5b" label="Open Access?"/>
+               <field id="b779cc6478bd4b09b516c6d55e938583" label="Synthesis?"/>
+               <field id="289c8814fff141d89b12569d49aa2cb3" label="Contribution Role">
+                   <lov id="00000000000000000000000100002102">Co-Author</lov>
+               </field>
+               <field id="dc7922dfa04348a3a83c9afb5bbaa24a" label="Number of Contributors">
+                   <value type="Number">11</value>
+               </field>
+               <field id="bc3b428d99384b04bb749311bb804e1d" label="Authors">
+                   <value type="String">Noriega de la Colina A, Your Favourite Place, Robitaille-Grou MC, Gagnon C, Boshkovski T, Lamarre-Cliche M, Joubert S, Gauthier C, Bherer L, Cohen-Adad J, Girouard H</value>
+               </field>
+               <field id="707a6e0ca58341a5a82fb923b2842530" label="Editors">
+                   <value type="String"/>
+               </field>
+               <field id="375a0e2ea0914291b05b0529c4755aa7" label="DOI">
+                   <value type="String"/>
+               </field>
+               <field id="9afd9e28df47464faf3f9ee2c4809e25" label="Contribution Percentage"/>
+               <field id="9f2e163dfcbf4abdb73e9d5c4daf03c4" label="Description of Contribution Role">
+                   <value type="Bilingual"/>
+                   <bilingual>
+                       <french/>
+                       <english/>
+                   </bilingual>
+               </field>
+           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

Then I uploaded it to the site, and redownloaded it, and cleaned it up (same as before:)

kousu@ail:~/src/neuropoly/bibeasy$ ./prettyxml CCV-10206959-reexported.xml > CCV-10206959-reexported.xml_ && mv CCV-10206959-reexported.xml_ CCV-10206959-reexported.xml

Here that is: CCV-10206959-reexported.xml

kousu@ail:~/src/neuropoly/bibeasy$ diff -u --color CCV-10206959-addition.xml CCV-10206959-reexported.xml
--- CCV-10206959-addition.xml   2021-10-19 01:47:47.520030354 -0400
+++ CCV-10206959-reexported.xml 2021-10-19 01:52:17.787744264 -0400
@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<generic-cv:generic-cv dateTimeGenerated="2021-10-19 01:39:37" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
+<generic-cv:generic-cv dateTimeGenerated="2021-10-19 01:52:02" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
    <section id="f589cbc028c64fdaa783da01647e5e3c" label="Personal Information">
        <section id="2687e70e5d45487c93a8a02626543f64" label="Identification" recordId="801f624b32b348f0bfb8cb9514083c7d">
            <field id="ee8beaea41f049d8bcfadfbfa89ac09e" label="Title">
@@ -154,7 +154,7 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="b781869115894f409ad525e796a448e0">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">ADDITION</value>
                </field>

So it's pretty clear: we can trigger an INSERT by just leaving the recordId blank. Their database will pick one for us.

kousu commented 3 years ago

This begs a question for me: should we be editing the publications at all? Maybe sync() should ignore the pre-existing CV and just create all new records every time? It's kind of rude since it probably leaves orphan publications sitting in their database, but would it work?

To find out, I dropped recordId from all entries, giving: CCV-10206959-step3.xml

--- CCV-10206959-reexported.xml 2021-10-19 01:52:17.787744264 -0400
+++ CCV-10206959-step3.xml  2021-10-19 01:59:12.513040443 -0400
@@ -24,7 +24,7 @@
    </section>
    <section id="047ec63e32fe450e943cb678339e8102" label="Contributions">
        <section id="46e8f57e67db48b29d84dda77cf0ef51" label="Publications">
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="38d1815fcf6143919814e40f1ce76b92">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">Cortico-spinal imaging to study pain</value>
                </field>
@@ -89,7 +89,7 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="4a66bf5d63cd4c37a848d3df602832d6">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">Associations between relative morning blood pressure, cerebral blood flow, and memory in older adults treated and controlled for hypertension</value>
                </field>
@@ -154,7 +154,7 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="b781869115894f409ad525e796a448e0">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">ADDITION</value>
                </field>

Before uploading this, the UI showed:

Screenshot 2021-10-19 at 02-00-31 Welcome to the Canadian Common CV

After uploading, the UI looks like

Screenshot 2021-10-19 at 02-05-05 Welcome to the Canadian Common CV

and the XML (after cleaning:

kousu@ail:~/src/neuropoly/bibeasy$ ./prettyxml CCV-10206959-step3-reexported.xml > CCV-10206959-step3-reexported.xml_ && mv CCV-10206959-step3-reexported.xml_ CCV-10206959-step3-reexported.xml

) now looks like CCV-10206959-step3-reexported.xml

kousu@ail:~/src/neuropoly/bibeasy$ diff -u --color CCV-10206959-step3.xml CCV-10206959-step3-reexported.xml
--- CCV-10206959-step3.xml  2021-10-19 01:59:12.513040443 -0400
+++ CCV-10206959-step3-reexported.xml   2021-10-19 02:02:41.443312315 -0400
@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<generic-cv:generic-cv dateTimeGenerated="2021-10-19 01:52:02" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
+<generic-cv:generic-cv dateTimeGenerated="2021-10-19 02:02:06" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
    <section id="f589cbc028c64fdaa783da01647e5e3c" label="Personal Information">
        <section id="2687e70e5d45487c93a8a02626543f64" label="Identification" recordId="801f624b32b348f0bfb8cb9514083c7d">
            <field id="ee8beaea41f049d8bcfadfbfa89ac09e" label="Title">
@@ -24,7 +24,7 @@
    </section>
    <section id="047ec63e32fe450e943cb678339e8102" label="Contributions">
        <section id="46e8f57e67db48b29d84dda77cf0ef51" label="Publications">
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="bb4549c3252f4be8b288879825bcbc39">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">Cortico-spinal imaging to study pain</value>
                </field>
@@ -89,7 +89,7 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="c36c62b23a3d42e6ae415331c68b7cd0">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">Associations between relative morning blood pressure, cerebral blood flow, and memory in older adults treated and controlled for hypertension</value>
                </field>
@@ -154,7 +154,7 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="c995b089b0534658965187202626d529">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">ADDITION</value>
                </field>

So it worked, and maybe there's no good point trying to match (#12) and patch publications, at least not when we know we can just overwrite all of them. The one weird quirk of doing this is that one "Submit" checkbox got toggled; because I guess it's on by default for new publications and I just tossed out the pre-existing ones rows and the XML file doesn't have a way to control that checkbox. (Matching (#12) is still necessary for filtering actual grant texts of course, just not, for this feature.)

Ah but here's a good reason not to throw out the pre-existing records: it disturbs the sort order. I got lucky in my example above, but I tried uploading a second time and got this order:

Screenshot 2021-10-19 at 02-11-27 Welcome to the Canadian Common CV

and a third and got

Screenshot 2021-10-19 at 02-13-31 Welcome to the Canadian Common CV

It's not a deal breaker -- afterall, the whole problem bibeasy is solving is IMO a bug in CCV: that it doesn't have a stable/controllable sort order; new items get inserted at random places instead of, say, being sorted chronologically, so every user of bibeasy has to be tolerant of changing sorts -- but this is a lot more disruptive than just renumbering, it potentially changes every relative ordering in the list on every addition.

kousu commented 3 years ago

The id attributes seem to be 1:1 with the label fields; the labels are what's documented in their official spec but from looking at the web API (e.g. above: -F submitParameters=field9a34d6b273914f18b2273e8de7c48fd6) they're using the ids internally.

Does that mean that to do "create" I need to extract the dict[Id, Label] table or, like with recordId, can I just drop id and have it figure it out?

Here is is that table, or part of it anyway, extracted from a sample publication, in case we need it;

>>> p
<Element 'section' at 0x7fed6bc628b8>
>>> pprint.pprint(dict(zip([e.attrib['id'] for e in p], [e.attrib['label'] for e in p])))
{'00ba1799ece344dc8d0779a3f05a4df8': 'Page Range',
 '0a826c656ff34e579dfcbfb373771260': 'Volume',
 '1167905d079c4400ae7a4a76a203a445': 'Description / Contribution Value',
 '2089ff1a86844b6c9a10fc63469f9a9d': 'Refereed?',
 '289c8814fff141d89b12569d49aa2cb3': 'Contribution Role',
 '375a0e2ea0914291b05b0529c4755aa7': 'DOI',
 '3b56e4362d6a495aa5d22a1de5914741': 'Publishing Status',
 '478545acac5340c0a73b7e0d2a4bee06': 'URL',
 '4ad593960aba4a21bf154fa8daf37f9f': 'Publisher',
 '4c3bc805ceaa42259f014514fc4905f8': 'Publication Location',
 '51b7eaff05444990af823b9d80924f5b': 'Open Access?',
 '5c04ea4dae464499807d0b40b4cad049': 'Journal',
 '6fafe258e19e49a7884428cb49d75424': 'Year',
 '707a6e0ca58341a5a82fb923b2842530': 'Editors',
 '9afd9e28df47464faf3f9ee2c4809e25': 'Contribution Percentage',
 '9f2e163dfcbf4abdb73e9d5c4daf03c4': 'Description of Contribution Role',
 'b779cc6478bd4b09b516c6d55e938583': 'Synthesis?',
 'bc3b428d99384b04bb749311bb804e1d': 'Authors',
 'cc1d9e14945b4e8496641dbe22b3448a': 'Issue',
 'dc7922dfa04348a3a83c9afb5bbaa24a': 'Number of Contributors',
 'f3fd4878d47c4e83aef6959620ba4870': 'Article Title'}

To answer this question:

kousu@ail:~/src/neuropoly/bibeasy$ cp CCV-10206959-step3-reexported.xml CCV-10206959-create-missing-fields.xml

and edited it so that

kousu@ail:~/src/neuropoly/bibeasy$ diff -u --color CCV-10206959-step3-reexported.xml CCV-10206959-create-missing-fields.xml 
--- CCV-10206959-step3-reexported.xml   2021-10-19 02:02:41.443312315 -0400
+++ CCV-10206959-create-missing-fields.xml  2021-10-19 12:51:47.477394216 -0400
@@ -219,6 +219,21 @@
                    </bilingual>
                </field>
            </section>
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+               <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
+                   <value type="String">MISSING FIELDS</value>
+               </field>
+               <field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
+                   <value type="Bilingual"/>
+                   <bilingual>
+                       <french/>
+                       <english/>
+                   </bilingual>
+               </field>
+               <field id="2089ff1a86844b6c9a10fc63469f9a9d" label="Refereed?">
+                   <lov id="00000000000000000000000000000400">Yes</lov>
+               </field>
+           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

CCV-10206959-create-missing-fields.xml

I uploaded it and downloaded it, and got:

kousu@ail:~/src/neuropoly/bibeasy$ ./prettyxml CCV-10206959-create-missing-fields-reexported.xml > CCV-10206959-create-missing-fields-reexported.xml_ && mv CCV-10206959-create-missing-fields-reexported.xml_  CCV-10206959-create-missing-fields-reexported.xml 

CCV-10206959-create-missing-fields-reexported.xml

kousu@ail:~/src/neuropoly/bibeasy$ diff -u --color  CCV-10206959-create-missing-fields.xml  CCV-10206959-create-missing-fields-reexported.xml
--- CCV-10206959-create-missing-fields.xml  2021-10-19 12:52:28.064164492 -0400
+++ CCV-10206959-create-missing-fields-reexported.xml   2021-10-19 12:55:45.891059874 -0400
@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<generic-cv:generic-cv dateTimeGenerated="2021-10-19 02:02:06" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
+<generic-cv:generic-cv dateTimeGenerated="2021-10-19 12:54:45" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
    <section id="f589cbc028c64fdaa783da01647e5e3c" label="Personal Information">
        <section id="2687e70e5d45487c93a8a02626543f64" label="Identification" recordId="801f624b32b348f0bfb8cb9514083c7d">
            <field id="ee8beaea41f049d8bcfadfbfa89ac09e" label="Title">
@@ -24,6 +24,21 @@
    </section>
    <section id="047ec63e32fe450e943cb678339e8102" label="Contributions">
        <section id="46e8f57e67db48b29d84dda77cf0ef51" label="Publications">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="8e73cefdd1ee47a58563ff099d4d6958">
+               <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
+                   <value type="String">MISSING FIELDS</value>
+               </field>
+               <field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
+                   <value type="Bilingual"/>
+                   <bilingual>
+                       <french/>
+                       <english/>
+                   </bilingual>
+               </field>
+               <field id="2089ff1a86844b6c9a10fc63469f9a9d" label="Refereed?">
+                   <lov id="00000000000000000000000000000400">Yes</lov>
+               </field>
+           </section>
            <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="bb4549c3252f4be8b288879825bcbc39">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">Cortico-spinal imaging to study pain</value>
@@ -219,21 +234,6 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
-               <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
-                   <value type="String">MISSING FIELDS</value>
-               </field>
-               <field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
-                   <value type="Bilingual"/>
-                   <bilingual>
-                       <french/>
-                       <english/>
-                   </bilingual>
-               </field>
-               <field id="2089ff1a86844b6c9a10fc63469f9a9d" label="Refereed?">
-                   <lov id="00000000000000000000000000000400">Yes</lov>
-               </field>
-           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

Huh, so it didn't fill in the missing fields. It did fill in the recordId as before (and, by chance, moved the publication up, which makes the diff less obvious, unfortunately).

In the UI, the new publication looks like

Screenshot 2021-10-19 at 13-00-45 Welcome to the Canadian Common CV

However after clicking "Done" on the UI and reexporting:

kousu@ail:~/src/neuropoly/bibeasy$ diff -u  CCV-10206959-create-missing-fields-reexported.xml  CCV-10206959-create-missing-fields-reexported2.xml
--- CCV-10206959-create-missing-fields-reexported.xml   2021-10-19 12:55:45.891059874 -0400
+++ CCV-10206959-create-missing-fields-reexported2.xml  2021-10-19 13:03:36.412812969 -0400
@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<generic-cv:generic-cv dateTimeGenerated="2021-10-19 12:54:45" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
+<generic-cv:generic-cv dateTimeGenerated="2021-10-19 13:02:02" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
    <section id="f589cbc028c64fdaa783da01647e5e3c" label="Personal Information">
        <section id="2687e70e5d45487c93a8a02626543f64" label="Identification" recordId="801f624b32b348f0bfb8cb9514083c7d">
            <field id="ee8beaea41f049d8bcfadfbfa89ac09e" label="Title">
@@ -28,6 +28,26 @@
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">MISSING FIELDS</value>
                </field>
+               <field id="5c04ea4dae464499807d0b40b4cad049" label="Journal">
+                   <value type="String"/>
+               </field>
+               <field id="0a826c656ff34e579dfcbfb373771260" label="Volume">
+                   <value type="String"/>
+               </field>
+               <field id="cc1d9e14945b4e8496641dbe22b3448a" label="Issue">
+                   <value type="String"/>
+               </field>
+               <field id="00ba1799ece344dc8d0779a3f05a4df8" label="Page Range">
+                   <value type="String"/>
+               </field>
+               <field id="3b56e4362d6a495aa5d22a1de5914741" label="Publishing Status"/>
+               <field id="6fafe258e19e49a7884428cb49d75424" label="Year">
+                   <value format="yyyy" type="Year"/>
+               </field>
+               <field id="4ad593960aba4a21bf154fa8daf37f9f" label="Publisher">
+                   <value type="String"/>
+               </field>
+               <field id="4c3bc805ceaa42259f014514fc4905f8" label="Publication Location"/>
                <field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
                    <value type="Bilingual"/>
                    <bilingual>
@@ -35,9 +55,35 @@
                        <english/>
                    </bilingual>
                </field>
+               <field id="478545acac5340c0a73b7e0d2a4bee06" label="URL">
+                   <value type="String"/>
+               </field>
                <field id="2089ff1a86844b6c9a10fc63469f9a9d" label="Refereed?">
                    <lov id="00000000000000000000000000000400">Yes</lov>
                </field>
+               <field id="51b7eaff05444990af823b9d80924f5b" label="Open Access?"/>
+               <field id="b779cc6478bd4b09b516c6d55e938583" label="Synthesis?"/>
+               <field id="289c8814fff141d89b12569d49aa2cb3" label="Contribution Role"/>
+               <field id="dc7922dfa04348a3a83c9afb5bbaa24a" label="Number of Contributors">
+                   <value type="Number"/>
+               </field>
+               <field id="bc3b428d99384b04bb749311bb804e1d" label="Authors">
+                   <value type="String"/>
+               </field>
+               <field id="707a6e0ca58341a5a82fb923b2842530" label="Editors">
+                   <value type="String"/>
+               </field>
+               <field id="375a0e2ea0914291b05b0529c4755aa7" label="DOI">
+                   <value type="String"/>
+               </field>
+               <field id="9afd9e28df47464faf3f9ee2c4809e25" label="Contribution Percentage"/>
+               <field id="9f2e163dfcbf4abdb73e9d5c4daf03c4" label="Description of Contribution Role">
+                   <value type="Bilingual"/>
+                   <bilingual>
+                       <french/>
+                       <english/>
+                   </bilingual>
+               </field>
            </section>
            <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="bb4549c3252f4be8b288879825bcbc39">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">

the fields have been filled in.

So the upshot of this is: CCV is tolerant of missing data, and we only need to fill in the fields we're actually using.

And to find out if its tolerant of missing id attributes, I reexported the current state and edited it such that:

kousu@ail:~/src/neuropoly/bibeasy$ cp CCV-10206959.xml CCV-10206959-create-with-missing-field-ids.xml 
kousu@ail:~/src/neuropoly/bibeasy$ vi CCV-10206959-create-with-missing-field-ids.xml 
kousu@ail:~/src/neuropoly/bibeasy$ diff -u  CCV-10206959.xml CCV-10206959-create-with-missing-field-ids.xml 
--- CCV-10206959.xml    2021-10-19 14:40:21.516399254 -0400
+++ CCV-10206959-create-with-missing-field-ids.xml  2021-10-19 14:41:40.876257303 -0400
@@ -280,6 +280,20 @@
                    </bilingual>
                </field>
            </section>
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+               <field label="Article Title">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field label="Journal">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field label="Volume">
+                   <value type="String">77</value>
+               </field>
+               <field label="Issue">
+                   <value type="String">5</value>
+               </field>
+           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

This file produced an error

Screenshot 2021-10-19 at 14-43-23 Welcome to the Canadian Common CV

I tried

kousu@ail:~/src/neuropoly/bibeasy$ diff -u  CCV-10206959.xml CCV-10206959-create-with-missing-field-ids.xml 
--- CCV-10206959.xml    2021-10-19 14:40:21.516399254 -0400
+++ CCV-10206959-create-with-missing-field-ids.xml  2021-10-19 14:44:44.343638379 -0400
@@ -280,6 +280,20 @@
                    </bilingual>
                </field>
            </section>
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+               <field id="707a6e0ca58341a5a82fb923b2842530" label="Article Title">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field id="375a0e2ea0914291b05b0529c4755aa7" label="Journal">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field label="Volume">
+                   <value type="String">77</value>
+               </field>
+               <field label="Issue">
+                   <value type="String">5</value>
+               </field>
+           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

where I've intentionally mismatched the id for "Editors" to the field for "Authors" and that for "DOI" into "Journal".

Same error

Screenshot 2021-10-19 at 14-46-11 Welcome to the Canadian Common CV

This one:

kousu@ail:~/src/neuropoly/bibeasy$ diff -u  CCV-10206959.xml CCV-10206959-create-with-missing-field-ids.xml 
--- CCV-10206959.xml    2021-10-19 14:40:21.516399254 -0400
+++ CCV-10206959-create-with-missing-field-ids.xml  2021-10-19 14:46:37.214501693 -0400
@@ -280,6 +280,14 @@
                    </bilingual>
                </field>
            </section>
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
+               <field id="707a6e0ca58341a5a82fb923b2842530" label="Article Title">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field id="375a0e2ea0914291b05b0529c4755aa7" label="Journal">
+                   <value type="String">ADDITION</value>
+               </field>
+           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

This one was accepted; and upon re-exporting:

kousu@ail:~/src/neuropoly/bibeasy$ diff -u  CCV-10206959-create-with-missing-field-ids.xml   CCV-10206959.xml
--- CCV-10206959-create-with-missing-field-ids.xml  2021-10-19 14:46:37.214501693 -0400
+++ CCV-10206959.xml    2021-10-19 14:48:27.840302582 -0400
@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<generic-cv:generic-cv dateTimeGenerated="2021-10-19 14:40:08" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
+<generic-cv:generic-cv dateTimeGenerated="2021-10-19 14:48:19" lang="en" xmlns:generic-cv="http://www.cihr-irsc.gc.ca/generic-cv/1.0.0">
    <section id="f589cbc028c64fdaa783da01647e5e3c" label="Personal Information">
        <section id="2687e70e5d45487c93a8a02626543f64" label="Identification" recordId="801f624b32b348f0bfb8cb9514083c7d">
            <field id="ee8beaea41f049d8bcfadfbfa89ac09e" label="Title">
@@ -24,6 +24,14 @@
    </section>
    <section id="047ec63e32fe450e943cb678339e8102" label="Contributions">
        <section id="46e8f57e67db48b29d84dda77cf0ef51" label="Publications">
+           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="550ef73512ae4ade9b5bd7b9c062b454">
+               <field id="707a6e0ca58341a5a82fb923b2842530" label="Editors">
+                   <value type="String">ADDITION</value>
+               </field>
+               <field id="375a0e2ea0914291b05b0529c4755aa7" label="DOI">
+                   <value type="String">ADDITION</value>
+               </field>
+           </section>
            <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles" recordId="8e73cefdd1ee47a58563ff099d4d6958">
                <field id="f3fd4878d47c4e83aef6959620ba4870" label="Article Title">
                    <value type="String">MISSING FIELDS</value>
@@ -280,14 +288,6 @@
                    </bilingual>
                </field>
            </section>
-           <section id="9a34d6b273914f18b2273e8de7c48fd6" label="Journal Articles">
-               <field id="707a6e0ca58341a5a82fb923b2842530" label="Article Title">
-                   <value type="String">ADDITION</value>
-               </field>
-               <field id="375a0e2ea0914291b05b0529c4755aa7" label="Journal">
-                   <value type="String">ADDITION</value>
-               </field>
-           </section>
        </section>
    </section>
 </generic-cv:generic-cv>

It's gone and rewritten the label attribute. So basically the label is useless, it's just a comment, and not really part of the data.

btw this is super annoying isn't it?

Screenshot 2021-10-19 at 14-47-53 Welcome to the Canadian Common CV

kousu commented 3 years ago

In doing 'create' I need to understand the type schema used in the XML, since I can't just edit a pre-existing data structure.

Most <field>s contain a single child <value>, marked <value type="String">:

<field id="375a0e2ea0914291b05b0529c4755aa7" label="DOI">
                    <value type="String">10.1016/j.neuroimage.2020.117439</value>
                </field>

but some contain what I assume are enums:

<field id="289c8814fff141d89b12569d49aa2cb3" label="Contribution Role">
                    <lov id="f6f1d34952c44f7c884ac56036b3ba4d">Last Author</lov>
                </field>

numbers

<field id="dc7922dfa04348a3a83c9afb5bbaa24a" label="Number of Contributors">
                    <value type="Number">3</value>
                </field>

or even years

<field id="6fafe258e19e49a7884428cb49d75424" label="Year">
                    <value format="yyyy" type="Year">2021</value>
                </field>

This is not how XML is meant to be used. You should have to parse a 'kind' field, the type should be stored in the tag types itself along with a schema specifying all the types for child nodes that implies. So e.g. there should be <article-title>10.1016/j.neuroimage.2020.117439</article-title> whose contents are implicitly interpreted as a string, or <year>2021</year> or <number-of-contributors>3</number-of-contributors>. The only field that seems to do something close to correct is this weird <value type="Bilingual" /> case:

<field id="1167905d079c4400ae7a4a76a203a445" label="Description / Contribution Value">
                    <value type="Bilingual" />
                    <bilingual>
                        <french />
                        <english />
                    </bilingual>
                </field>

I think what's going on here is it's saying this field can have two values simultaneously: an french version and an english version. There's a "Show Bilingual Fields" button on the UI which must let you do data entry. And here they've done the reasonable thing: the french content goes between <french>...</french> tags, not <value type="bilingual" subtype="french">...</value> tags :roll_eyes:

But I'll deal with it because we have to.

kousu commented 3 years ago

Looking at the fields in https://docs.google.com/spreadsheets/d/1dEUBYf17hNM22dqV4zx1gsh3Q-d97STnRB4q7p9nQ54/edit#gid=566297787, I think we can sync these:

Currently find_matching_ref detects problems (things needing to be synced) in the first and third fields, but not the others.

kousu commented 3 years ago

I also notice: a lot of the conference titles in the gsheet include their location; but CCV has a separate "Location" field we could try to fill in. For example, here I would move "Berlin" to "City" and "Germany" to "Location":

                                <field id="b3c8a60c053a405597b92899d95765a3" label="Conference Name">
                                        <value type="String">4 Jahrestagung der Deutschen Gesellschaft für Computer-und Roboter-Assistierte Chirurgie, Berlin, Germany</value>
                                </field>
                                <field id="5813833859a64bb58ee55e4f55aff29b" label="Conference Location"/>
                                <field id="c2efd9725588489b8df73467c5597c32" label="City">
                                        <value type="String"/>
                                </field>
jcohenadad commented 2 months ago

@namgo Do you have an update on this? I'm about to submit a grant in a few weeks and this feature would be really useful. Many thanks!

namgo commented 2 months ago

I don't have any updates sorry, I had started following Kousu's steps and then other tasks came up! I had signed up for ccv-cvc, and have some example code somewhere (I think), so I think I have everything I need to get started on it again.

I'll re-prioritize this issue.

namgo commented 2 months ago

(edit: I was complaining about lack of access to our cluster for these sorts of projects, but nevermind my laptop finished building pandas, I was being impatient)

I'm struggling to understand if the mismatched UUID Nick saw is important or not, but, I tried to get myself back on track by looking at existing works by others:

It looks like a lot of the constants we depend on were also found independently by https://github.com/sylvainhalle/CCCVTK , this seems like a great reference moving forward if we weren't aware of it yet.

More recently, https://ahemnason.notion.site/ORCID-to-CCV-7cfb24c9f13c4d869cd2beb950e9e2e2 looks to me like it's not directly relevant to our needs... but is it, so far as I know we don't use ORCID?

namgo commented 2 months ago

Okay I had the sense of what I was doing... partially right and partially wrong:

In this case an asterisk specifically needs to be added to HQP (students of Julien as denoted by Julien) which means overwriting existing names. The asterisk needs to be beside student names in the name field.

(side-note: what if we add asterisk'd student names purely from an exported xml to immediately re-upload? This would mean the iteratively written document is correct but wheww)

I have a small test xml set CCV-10259659(initial_export).xml.txt

Which I got from removing sections regarding the Testy Mc Testerson account (the first xml attrs) and reuploading.

nameless-initialization.xml.txt

I modified the initial xml test set to have asterisks and reuploaded:

initial-export_modified-for-asterisk.xml.txt

Which transforms:

CCV-NathanGorvett.pdf

Into:

CCV-NathanGorvett-1.pdf

:) asterisks!!

However none of this addresses the automation section of this issue, but I'm figuring it out.