glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Produce files for Europe PMC integration #257

Open ReneRanzinger opened 1 year ago

ReneRanzinger commented 1 year ago

Based on the specs in #256 produce the input for Europe PMC. Keep in mind that this has to be an ongoing process with each data release.

Dependencies:

Blocker for:

jeet-vora commented 10 months ago

For EuropePMC linkouts XML file/s needs to be produced which should be 9MB each.

Content for the XML file

(Example PMID 33806155 - glygen_europepmc_example_33806155.txt) Schema - https://europepmc.org/LabsLink

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
  <link providerId="2167">
    <resource>
      <title>Glycan - G57321FI</title>
      <url>https://www.glygen.org/publication/PubMed/33806155</url>
      <image>https://api.glygen.org/glycan/image/G57321FI</image>
    </resource>
    <record>
      <source>MED</source>
      <id>33806155</id>
    </record>

provider id>: EuropePMC assigned ID for GlyGen which is 2167

resource: Contains information about the individual Glycans and Proteins resource/title: Glycan GlyTouCan or Protein UniProtKB Accession, preceded by the word Glycan or Protein (Glycan - G57321FI) resource/url: URL link for the GlyGen publication details page pointing to the PMID that references the Glycan or Protein (https://www.glygen.org/publication/PubMed/33806155) resource/image: URL for glycan Image. Only for Glycans having the SNFG image (https://api.glygen.org/glycan/image/G57321FI)

record: Information about the PMID record/source:Three letter code for the evidence resource that can be PubMed [MED]or DOI [DOI] for GlyGen (MED) record/id: ID for the referenced publication, which can be PMID or DOI (33806155)

Input files

All __citations*.csv, target unirprotkb_canonical_ac, glytoucan_ac and xref_id, these will files will provide with PMID/DOI for resource/url and source/id, Protein and Glycan Accession for title..

Based on the GlyTouCan ac, retrieve the URL for the glycan image for the resource/image

Output file

Format: XML (9MB each, split one files if necessary) Name: all_europepmc_linkouts.xml

Note: Dataset masterlist has been updated, make changes if more than one file needs to be created.

jeet-vora commented 10 months ago

@rykahsay There are 5 issue detected.

Issues

The all_europepmc_linkouts.xml is ~420MB. EuropePMC only accepts a file of about ~10MB. Can you split these files into multiple XML files of 9.5MB each?

The final file in the reviewed folder will be a zip file containing several such files, have made changes in the dataset masterlist. Also I am assuming that when new data is added new files will automatically be generated.

rykahsay commented 9 months ago

Check

jeet-vora commented 9 months ago

Checked. Uploading to EuPMC. Will reopen if issues found by them.

jeet-vora commented 9 months ago

@rykahsay Two issues were detected by EuPMC.

</xml> closing tag at the end is present in all the files that is not needed in the XML. Please remove from all files. Starting tag is fine.

<url>https://www.glygen.org/publication/MED/11402059</url>
      </resource>
      <record>
         <source>MED</source>
         <id>11402059</id>
      </record>
   </link>
</links>
</xml>

I have modified masterlist, now there are two data packages europepmc_linkouts_one and europepmc_linkouts_two However the previous format was - .XML. Both files need to be .zip so changing it to .zip. Also since the files are being split all word can be removed from the dataset name from the current dataset name - all_europepmc_linkouts.zip

rykahsay commented 9 months ago

all represents (all molecules -- glycan, protein, proteoform), so it is required

rykahsay commented 9 months ago

You think this is all you need to do for the changes you requested? Think harder

jeet-vora commented 9 months ago

@rykahsay If you are referring to BCOs, then it will be one BCO for all newly created zip files like https://data.glygen.org/GLY_000622 (Pubmed Linkouts)

image

Let me know if you want me to add the new dataset info in the BCO - https://biocomputeobject.org/GLY_000815/DRAFT

https://biocomputeobject.org/GLY_000815/DRAFT

I can't think of anything else at this moment.

rykahsay commented 9 months ago

Looks like I will always have to do this

cd /software/glygen/
python3 check-bco2filename-mapping.py  | grep ERROR
NO-BCO,all_europepmc_linkouts_two.zip,ERROR,in_fs
NO-BCO,all_europepmc_linkouts_one.zip,ERROR,in_fs
GLY_000815,all_europepmc_linkouts.zip,ERROR,in_bco

![Uploading image.png…]()

jeet-vora commented 9 months ago

@rykahsay Another change requested for entries with DOI. See example below.

DOI Issue From

 <link providerId="2167">
>   <resource>
>>  <title>Protein - G57321FI</title>
>>  <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
>   </resource>
>   <record>
>>  <source>DOI</source>
>>  <id>10.1038/emboj.2013.79</id>
>   </record>
   </link>

To

 <link providerId="2167">
>   <resource>
>>  <title>Protein - G57321FI</title>
>>  <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
>   </resource>
>>  <doi>10.1038/emboj.2013.79<doi>
   </link>

Links Issue From

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
   <link providerId="2167">

To

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
    <link providerId="2167">
rykahsay commented 9 months ago

check now

jeet-vora commented 9 months ago

Done

jeet-vora commented 9 months ago

@rykahsay Remove "record" property for all DOI entries

image
rykahsay commented 9 months ago

check now

jeet-vora commented 9 months ago

@rykahsay You have removed <doi> as well along with <record>,only <record> has to be removed. ( see current example below) Please add the <doi> tag again. The <doi> tag should not be inside the <record> tag.

change

<link providerId="2167">
      <resource>
         <title>Protein - G57321FI</title>
         <url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
      </resource>
   </link>
<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
   </link>

to

<link providerId="2167">
      <resource>
         <title>Protein - G57321FI</title>
         <url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
      </resource>
        <doi>10.3390/v13040551</doi>
   </link>
<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
        <doi>10.1016/j.celrep.2021.109179</doi>
   </link>
rykahsay commented 9 months ago

fixed

jeet-vora commented 9 months ago

Once issues are collected will assign it to Robel.

**Protein** - G02815KT
<link providerId="2167">
      <resource>
         <title>Protein - G02815KT</title>
         <url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
      </resource>
        <doi>10.1016/j.celrep.2021.109179</doi>
   </link>