Open ReneRanzinger opened 1 year ago
For EuropePMC linkouts XML
file/s needs to be produced which should be 9MB each.
(Example PMID 33806155 - glygen_europepmc_example_33806155.txt) Schema - https://europepmc.org/LabsLink
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
<link providerId="2167">
<resource>
<title>Glycan - G57321FI</title>
<url>https://www.glygen.org/publication/PubMed/33806155</url>
<image>https://api.glygen.org/glycan/image/G57321FI</image>
</resource>
<record>
<source>MED</source>
<id>33806155</id>
</record>
provider id>: EuropePMC assigned ID for GlyGen which is 2167
resource: Contains information about the individual Glycans and Proteins resource/title: Glycan GlyTouCan or Protein UniProtKB Accession, preceded by the word Glycan or Protein (Glycan - G57321FI) resource/url: URL link for the GlyGen publication details page pointing to the PMID that references the Glycan or Protein (https://www.glygen.org/publication/PubMed/33806155) resource/image: URL for glycan Image. Only for Glycans having the SNFG image (https://api.glygen.org/glycan/image/G57321FI)
record: Information about the PMID record/source:Three letter code for the evidence resource that can be PubMed [MED]or DOI [DOI] for GlyGen (MED) record/id: ID for the referenced publication, which can be PMID or DOI (33806155)
All __citations*.csv, target unirprotkb_canonical_ac
, glytoucan_ac
and xref_id
, these will files will provide with PMID/DOI for resource/url
and source/id
, Protein and Glycan Accession for title..
Based on the GlyTouCan ac, retrieve the URL for the glycan image for the resource/image
Format: XML (9MB each, split one files if necessary) Name: all_europepmc_linkouts.xml
Note: Dataset masterlist has been updated, make changes if more than one file needs to be created.
@rykahsay There are 5 issue detected.
The all_europepmc_linkouts.xml is ~420MB. EuropePMC only accepts a file of about ~10MB. Can you split these files into multiple XML files of 9.5MB each?
The final file in the reviewed folder will be a zip file containing several such files, have made changes in the dataset masterlist. Also I am assuming that when new data is added new files will automatically be generated.
[ ] \m in \m should be removed
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
<link providerId="2167">\m <resource>
<title>Protein - Q9NYM9</title>
[ ] shouldn't "resource" in <link providerId="2167">\m <resource>
should be on the next line
[ ] link providerId="2167" needs to be for every resource and record tied to a bibliographic record. So it should be like below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
~~~~~~~~~~~~~~~~~~~~<link providerId="2167">~~~~~~~~~~~~~~~~~~~~
<resource>
<title>Glycan - G57321FI</title>
<url>https://www.glygen.org/publication/PubMed/33806155</url>
<image>https://api.glygen.org/glycan/image/G57321FI</image>
</resource>
<record>
<source>MED</source>
<id>33806155</id>
</record>
</link>
~~~~~~~~~~~~~~~~~~~~<link providerId="2167">~~~~~~~~~~~~~~~~~~~~
<resource>
<title>Glycan - G00031MO</title>
<url>https://www.glygen.org/publication/PubMed/33806155</url>
<image>https://api.glygen.org/glycan/image/G00031MO</image>
</resource>
<record>
<source>MED</source>
<id>33806155</id>
</record>
[ ] The source should be MED instead of PubMed and ID should be PMID. incorrect
<record>
<source>PubMed</source>
<id>PubMed</id>
</record>
correct
<record>
<source>MED</source>
<id>33806155</id>
</record>
Check
Checked. Uploading to EuPMC. Will reopen if issues found by them.
@rykahsay Two issues were detected by EuPMC.
</xml>
closing tag at the end is present in all the files that is not needed in the XML. Please remove from all files. Starting tag is fine.
<url>https://www.glygen.org/publication/MED/11402059</url>
</resource>
<record>
<source>MED</source>
<id>11402059</id>
</record>
</link>
</links>
</xml>
I have modified masterlist, now there are two data packages europepmc_linkouts_one
and europepmc_linkouts_two
However the previous format was - .XML. Both files need to be .zip so changing it to .zip. Also since the files are being split all
word can be removed from the dataset name from the current dataset name - all_europepmc_linkouts.zip
all represents (all molecules -- glycan, protein, proteoform), so it is required
You think this is all you need to do for the changes you requested? Think harder
@rykahsay If you are referring to BCOs, then it will be one BCO for all newly created zip files like https://data.glygen.org/GLY_000622 (Pubmed Linkouts)
Let me know if you want me to add the new dataset info in the BCO - https://biocomputeobject.org/GLY_000815/DRAFT
https://biocomputeobject.org/GLY_000815/DRAFT
I can't think of anything else at this moment.
Looks like I will always have to do this
cd /software/glygen/
python3 check-bco2filename-mapping.py | grep ERROR
NO-BCO,all_europepmc_linkouts_two.zip,ERROR,in_fs
NO-BCO,all_europepmc_linkouts_one.zip,ERROR,in_fs
GLY_000815,all_europepmc_linkouts.zip,ERROR,in_bco
![Uploading image.png…]()
@rykahsay Another change requested for entries with DOI. See example below.
<source>DOI</source>
lines, replace id
with DOI
<links>
to next line keeping the formating for all files in the XML prologue<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
DOI Issue From
<link providerId="2167">
> <resource>
>> <title>Protein - G57321FI</title>
>> <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
> </resource>
> <record>
>> <source>DOI</source>
>> <id>10.1038/emboj.2013.79</id>
> </record>
</link>
To
<link providerId="2167">
> <resource>
>> <title>Protein - G57321FI</title>
>> <url>https://www.glygen.org/publication/DOI/10.1038/emboj.2013.79</
url>
> </resource>
>> <doi>10.1038/emboj.2013.79<doi>
</link>
Links Issue From
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><links>
<link providerId="2167">
To
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<links>
<link providerId="2167">
check now
Done
@rykahsay Remove "record" property for all DOI entries
check now
@rykahsay
You have removed <doi>
as well along with <record>
,only <record>
has to be removed. ( see current example below)
Please add the <doi>
tag again. The <doi>
tag should not be inside the <record>
tag.
change
<link providerId="2167">
<resource>
<title>Protein - G57321FI</title>
<url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
</resource>
</link>
<link providerId="2167">
<resource>
<title>Protein - G02815KT</title>
<url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
</resource>
</link>
to
<link providerId="2167">
<resource>
<title>Protein - G57321FI</title>
<url>https://www.glygen.org/publication/DOI/10.3390/v13040551</url>
</resource>
<doi>10.3390/v13040551</doi>
</link>
<link providerId="2167">
<resource>
<title>Protein - G02815KT</title>
<url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
</resource>
<doi>10.1016/j.celrep.2021.109179</doi>
</link>
fixed
Once issues are collected will assign it to Robel.
<link providerId="2167">
<resource>
<title>Protein - G02815KT</title>
<url>https://www.glygen.org/publication/DOI/10.1016/j.celrep.2021.109179</url>
</resource>
<doi>10.1016/j.celrep.2021.109179</doi>
</link>
Based on the specs in #256 produce the input for Europe PMC. Keep in mind that this has to be an ongoing process with each data release.
Dependencies:
256
Blocker for:
258