Produce files for Europe PMC integration

ReneRanzinger commented 1 year ago

Based on the specs in #256 produce the input for Europe PMC. Keep in mind that this has to be an ongoing process with each data release.

Dependencies:

256

Blocker for:

258

jeet-vora commented 10 months ago

For EuropePMC linkouts XML file/s needs to be produced which should be 9MB each.

Content for the XML file

(Example PMID 33806155 - glygen_europepmc_example_33806155.txt) Schema - https://europepmc.org/LabsLink

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
  <link providerId="2167">
    <resource>
      <title>Glycan - G57321FI</title>
      <url>https://www.glygen.org/publication/PubMed/33806155</url>
      <image>https://api.glygen.org/glycan/image/G57321FI</image>
    </resource>
    <record>
      <source>MED</source>
      <id>33806155</id>
    </record>

provider id>: EuropePMC assigned ID for GlyGen which is 2167

resource: Contains information about the individual Glycans and Proteins resource/title: Glycan GlyTouCan or Protein UniProtKB Accession, preceded by the word Glycan or Protein (Glycan - G57321FI) resource/url: URL link for the GlyGen publication details page pointing to the PMID that references the Glycan or Protein (https://www.glygen.org/publication/PubMed/33806155) resource/image: URL for glycan Image. Only for Glycans having the SNFG image (https://api.glygen.org/glycan/image/G57321FI)

record: Information about the PMID record/source:Three letter code for the evidence resource that can be PubMed [MED]or DOI [DOI] for GlyGen (MED) record/id: ID for the referenced publication, which can be PMID or DOI (33806155)

Input files

All __citations*.csv, target unirprotkb_canonical_ac, glytoucan_ac and xref_id, these will files will provide with PMID/DOI for resource/url and source/id, Protein and Glycan Accession for title..

Based on the GlyTouCan ac, retrieve the URL for the glycan image for the resource/image

Output file

Format: XML (9MB each, split one files if necessary) Name: all_europepmc_linkouts.xml

Note: Dataset masterlist has been updated, make changes if more than one file needs to be created.

jeet-vora commented 10 months ago

@rykahsay There are 5 issue detected.

Issues

[ ] Split large xml file into 9.5MB file XML each to be packaged as all_europepmc_linkouts.zip

The all_europepmc_linkouts.xml is ~420MB. EuropePMC only accepts a file of about ~10MB. Can you split these files into multiple XML files of 9.5MB each?

The final file in the reviewed folder will be a zip file containing several such files, have made changes in the dataset masterlist. Also I am assuming that when new data is added new files will automatically be generated.

[ ] \m in \m should be removed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
<link providerId="2167">\m      <resource>
     <title>Protein - Q9NYM9</title>

[ ] shouldn't "resource" in <link providerId="2167">\m <resource> should be on the next line

[ ] link providerId="2167" needs to be for every resource and record tied to a bibliographic record. So it should be like below

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
~~~~~~~~~~~~~~~~~~~~<link providerId="2167">~~~~~~~~~~~~~~~~~~~~
<resource>
  <title>Glycan - G57321FI</title>
  <url>https://www.glygen.org/publication/PubMed/33806155</url>
  <image>https://api.glygen.org/glycan/image/G57321FI</image>
</resource>
<record>
  <source>MED</source>
  <id>33806155</id>
</record>
</link>
~~~~~~~~~~~~~~~~~~~~<link providerId="2167">~~~~~~~~~~~~~~~~~~~~
<resource>
  <title>Glycan - G00031MO</title>
  <url>https://www.glygen.org/publication/PubMed/33806155</url>
  <image>https://api.glygen.org/glycan/image/G00031MO</image>
</resource>
<record>
  <source>MED</source>
  <id>33806155</id>
</record>

[ ] The source should be MED instead of PubMed and ID should be PMID. incorrect

<record>
     <source>PubMed</source>
     <id>PubMed</id>
  </record>

correct

<record>
  <source>MED</source>
  <id>33806155</id>
</record>

rykahsay commented 9 months ago

Check

jeet-vora commented 9 months ago

Checked. Uploading to EuPMC. Will reopen if issues found by them.

jeet-vora commented 9 months ago

@rykahsay Two issues were detected by EuPMC.

[ ] tag
[ ] zip files (packaged) should also be less than 10MB

</xml> closing tag at the end is present in all the files that is not needed in the XML. Please remove from all files. Starting tag is fine.

<url>https://www.glygen.org/publication/MED/11402059</url>
      </resource>
      <record>
         <source>MED</source>
         <id>11402059</id>
      </record>
   </link>
</links>
</xml>

I have modified masterlist, now there are two data packages europepmc_linkouts_one and europepmc_linkouts_two However the previous format was - .XML. Both files need to be .zip so changing it to .zip. Also since the files are being split all word can be removed from the dataset name from the current dataset name - all_europepmc_linkouts.zip

rykahsay commented 9 months ago

all represents (all molecules -- glycan, protein, proteoform), so it is required

rykahsay commented 9 months ago

You think this is all you need to do for the changes you requested? Think harder

jeet-vora commented 9 months ago

@rykahsay If you are referring to BCOs, then it will be one BCO for all newly created zip files like https://data.glygen.org/GLY_000622 (Pubmed Linkouts)

Let me know if you want me to add the new dataset info in the BCO - https://biocomputeobject.org/GLY_000815/DRAFT

https://biocomputeobject.org/GLY_000815/DRAFT

I can't think of anything else at this moment.

rykahsay commented 9 months ago

Looks like I will always have to do this

cd /software/glygen/
python3 check-bco2filename-mapping.py  | grep ERROR
NO-BCO,all_europepmc_linkouts_two.zip,ERROR,in_fs
NO-BCO,all_europepmc_linkouts_one.zip,ERROR,in_fs
GLY_000815,all_europepmc_linkouts.zip,ERROR,in_bco

![Uploading image.png…]()

jeet-vora commented 9 months ago

@rykahsay Another change requested for entries with DOI. See example below.

[ ] remove <source>DOI</source> lines, replace id with DOI10.1038/emboj.2013.79
[ ] Add <links> to next line keeping the formating for all files in the XML prologue<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>

DOI Issue From

 <link providerId="2167">
>   <resource>
>>  <title>Protein - G57321FI</title>
>>  <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
>   </resource>
>   <record>
>>  <source>DOI</source>
>>  <id>10.1038/emboj.2013.79</id>
>   </record>
   </link>

To

 <link providerId="2167">
>   <resource>
>>  <title>Protein - G57321FI</title>
>>  <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
>   </resource>
>>  <doi>10.1038/emboj.2013.79<doi>
   </link>

Links Issue From

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
   <link providerId="2167">

To

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
    <link providerId="2167">

rykahsay commented 9 months ago

check now

jeet-vora commented 9 months ago

Done

jeet-vora commented 9 months ago

@rykahsay Remove "record" property for all DOI entries

rykahsay commented 9 months ago

check now

jeet-vora commented 9 months ago

@rykahsay You have removed <doi> as well along with <record>,only <record> has to be removed. ( see current example below) Please add the <doi> tag again. The <doi> tag should not be inside the <record> tag.

change

<link providerId="2167">
      <resource>
         <title>Protein - G57321FI</title>
         <url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
      </resource>
   </link>

<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
   </link>

to

<link providerId="2167">
      <resource>
         <title>Protein - G57321FI</title>
         <url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
      </resource>
        <doi>10.3390/v13040551</doi>
   </link>

<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
        <doi>10.1016/j.celrep.2021.109179</doi>
   </link>

rykahsay commented 9 months ago

fixed

jeet-vora commented 9 months ago

Once issues are collected will assign it to Robel.

[ ] GlyTouCan accession is also denoted by protein rather than by Glycan

**Protein** - G02815KT

<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
        <doi>10.1016/j.celrep.2021.109179</doi>
   </link>

[ ] Other issue .....

glygener / glygen-issues