Open AChatzigoulas opened 4 years ago
Dear @AChatzigoula, unfortunately 10am or 1pm EEST would not work for me, as I'm already booked with other telcos during that period, but I'm available on Thursday until between 11am and 3:30pm Athens time. Would that work for you?
Kind regards, Harry
Hi Harry,
Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe
Hi Harry,
Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe
Hi Harry,
Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe
Hi Harry,
Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe
Hi Zoe,
The only slot that I can cancel tomorrow is 3-4:30pm (attending the VLDB 2020 Round table discussion on "Intelligent Data Exploration") but that's ok if it suits you as I want to know how things are going on and what help you require from OpenAIRE.
Best, Harry
Hi Harry,
No that's ok please dont cancel it. I will be traveling on Thursday but I guess I could participate in a call at 11am. Will you send us a zoom link?
OK, great, I'll send a link for Thu 11am.
Dentica Meeting - OpenAIRE-Advance Open Innovation Call Thu, Sep 3, 2020 11:00 AM - 12:00 PM (EEST)
Please join my meeting from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/830960165
You can also dial in using your phone. (For supported devices, tap a one-touch number below to join instantly.)
United States (Toll Free): 1 866 899 4679
United States: +1 (571) 317-3116
Access Code: 830-960-165
More phone numbers: (For supported devices, tap a one-touch number below to join instantly.)
Australia: +61 2 9091 7603
Austria: +43 7 2081 5337
Belgium: +32 28 93 7002
Canada: +1 (647) 497-9373
Denmark: +45 32 72 03 69
Finland: +358 923 17 0556
France: +33 187 210 241
Germany: +49 721 6059 6510
Ireland: +353 15 360 756
Italy: +39 0 230 57 81 80
Netherlands: +31 207 941 375
New Zealand: +64 9 913 2226
Norway: +47 21 93 37 37
Spain: +34 932 75 1230
Sweden: +46 853 527 818
Switzerland: +41 225 4599 60
United Kingdom: +44 20 3713 5011
New to GoToMeeting? Get the app now and be ready when your first meeting starts: https://global.gotomeeting.com/install/830960165
Dear Zoe, all,
Regarding the issues you've been having with the OpenAIRE I've managed to get a reply from Claudio at CNR: "Please ignore that dump files. We got a more recent one on OpenAIRE's Hadoop HDFS, represented in a simpler and better documented json-based data model".
So, he will get back to me with more details soon (he's on parental leave this week, but he might give me more info by tomorrow).
Dear Harry, This is excellent news! Looking forward to receiving the new files.
Dear Zoe, all,
Claudio said that the new graph dump is represented according to the following json schema https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop/src/branch/dump/dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/dump_whole/schema (You might find some minor issues here and there as we're still finalising it, but it will be published soon as the official graph dump model v1.0)
I'm still waiting for instructions on how you can access this dump on HDFS.
Claudio now informed me that HDFS is not normally accessed by external users, so he's trying to get an answer from project admin if we can consider you as "extended tech team" and grant you access for the duration of phase 2, etc.
Dear Harry,
Thank you very much for your prompt actions. It would be great to have access to the file in this formal. Also, we'd certainly need to know if this is a format that you plan to stick to in order to adopt it. If it;s possible to send us the feedback of our application, that would be great.
Dear Zoe,
I just had an update from Claudio:
" last week when we discussed the possibility to grant temporary access on HDFS to the OpenCall participants I didn’t keep in mind that the whole set of tools and web UIs we (tech team) use on a daily basis are reachable only through our VPN, so in my opinion this is a no-go for external users, I don’t think we can support them to configure the clients on their sides. So since Miriam won’t be back until next week we cannot expect the dump to be published on Zenodo until at least 10days/2weeks, therefore to cut the corner and save some time I’m trying to move the file containing the publications (plus the other result types) on some VM@CNR, where I’ll make it temporarily available for some time through an HTTP url. "
I think that is the best and easiest solution for you at the moment. Let's wait for Claudio to copy the data to a site you can access and download it.
Regarding the format, the official version will be published in about 10 days but it is very likely that it will be identical to this one.
All the best, Harry
Actually he's just done it:
_Harry the result.tar file is available from https://dev-openaire.d4science.org/dump/result.tar I downloaded ~4Gb of it, un-tarred and noticed the content is there, so assuming the transfer didn’t corrupt the rest of the file it should be OK to pass it over to the participants.
Regarding the data model, we’re going to use that JSON schema as reference model for these json dump files, but as I mentioned in skype with Thanasis &CO it still needs some adjustments before it can be officially published (edited)_
Hi Harry, thanks a lot! - I am forwarding this info to the team and we ;ll let you know if we have any questions!!
Dear Harry,
We have now downloaded and processed the json file. I would like to confirm that you would like us to provide our new dump to the OpenAire Research Graph in this format. If yes, we have processed an example of five entries in the same json format, for you to check that this is indeed the desired format. How can I send you the files?
Best Zoe
Archive.zip OK I could upload the files here.
result_sample.json contains five entries from your file
ingredio_compounds_sample.json contains five entries from our data
The final json file that we will deliver to you can be in the format of ingredio_compounds_sample.json or are there any other changes needed?
Also, if you could send us the feedback we received form reviewers that would be great.
Dear Zoe, I'm glad that you managed to download and process the json graph file. Thank you for providing a sample of your output.
So if I understand this correctly, in each line you provide a chemical compound (pubchem id) linked to a number of PMC publications, giving PMID, Pubchem_ID, Article (title), Journal, Abstract, and DOI for each. This looks fine to me but I've shared your example with Claudio at CNR to be absolutely sure, so I'll let you know as soon as he replies.
Are you going to be processing only PubMed articles or from other repostories too? The graph contains only Abstracts, so let me know if you will later be also requiring full-texts of any subset of the publications. PDFs when available (depending on the license, etc.) are converted to plaintexts in OpenAIRE. So if you have a set of OpenAIRE IDs or DOIs or other publication IDs (like PMID), Marek from ICM could fetch the plaintexts for you.
I'll try to fetch the reviewers' comments from Phase 2 for you later today.
The comment I got from CNR was: so at least we need: sitename (e.g. PubChem) label (the title/label of the chemical) url (url to the chemical in PubChem) refidentifier (the id of the chemical)
CNR is waiting also for the opinion of ICM who deals with mining representation in OpenAIRE.
Sorry actually I truncated the full comment from Alessia at CNR, here it is:
Alessia Bardi 4:01 PM I think it would be better to have also the openaire identifier of the publication in their output Probably the pubchem identifier is not enough for us. Assuming we will include them as we do for PDB. let me check what we need 4:07 ok, see here: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/ExternalReference.java
(My comment... this is how we represent PDB entries in OpenAIRE, as external references, so Alessia is suggesting we do the same with chemicals).
Alessia Bardi 4:12 PM so at least we need: sitename (e.g. PubChem) label (the title/label of the chemical) url (url to the chemical in PubChem) refidentifier (the id of the chemical)
Marek from ICM added that he is fine with the output provided and Alessia's comments but wanted to know if you are going to provide the dumps periodically to make them "consumable" by OpenAIRE or is your codebase planned to be run as a part of IIS (Information Inference Service of OpenAIRE)?
As far as I remember from your proposal (paragraph on "Maintenance"), you will not be providing or integrating code with IIS but only providing updates. Am I correct?
Here are the comments of the consensus report for Phase 2:
Comments
The whole concept of DENTICA is very interesting and useful for OpenAIRE, while the workflow of Ingredio and its app is attractive for commercialisation. This is clearly a win-win collaboration that will enrich the OpenAIRE Research Graph, even if it is mostly relevant to the domain of chemical ingredients in food and cosmetics, while also delivering a new product for their app and benefiting their company.
In evaluating the initial Phase 1 proposal, there was almost a complete lack of information on how the text mining algorithms would work; however, this concern has now been adequately addressed both in the Phase 1 deliverable and even further in this Phase 2 Prototype template with step-by-step descriptions. Some details might still not be fully clear, but the overall solution design now looks very well-structured and reasonable.
Minor revision of the approach for the integration of the results into the OpenAIRE Research Graph may be required but this could be discussed with the OpenAIRE Technical team during the Phase 2 implementation. There were no updates to their Business Canvas model or the Cost Plan.
Dear Harry,
Thank you for your feedback on all matters and for the proposal feedback! Here are some responses.
1) Regarding the format, we looked at https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/ExternalReference.java and have two questions: a) what is expected in "qualifier"? b) what is expected in "query"?
We can fill in the rest of the entries with no problems. Today we plan to finalize our training set for the machine learning algorithm that we will write to explore the OpenAire dumps. The training set will have the form/entries of the link above.
2) Regarding the updates, indeed we had planned to provide dumps periodically to update the content. If you think is useful to integrate code with IIS we can certainly do it towards the end of the project/after the project since it is not one of our deliverables.
Best Zoe
Dear Zoe,
1a) "qualifier" is meant to indicate the typology of the external reference. Currently the values supported by the relative vocabulary (dnet:externalReference_typologies) are the following:
So, depending on the type of the external reference you are going to provide us, you can pick one of those 4 values, or suggest new ones, so that we can extend our vocabulary definition. Here’s the json representation for two of them
{"classid":"url","classname":"url","schemeid":"dnet:externalReference_typologies","schemename":"dnet:externalReference_typologies"}
{"classid":"accessionNumber","classname":"accessionNumber","schemeid":"dnet:externalReference_typologies","schemename":"dnet:externalReference_typologies"}
1b) Regarding the field "query" at the moment it is not used. I suggest you keep it empty.
2) OK great. Regarding integration with IIS we can discuss about it after the end of the project.
Best regards, Harry
Dear Harry,
Thanks very much. We have an updated json file. Could you and your colleagues take a look so that we can finalize the format? newstruct.txt
Best Zoe
Dear Harry,
Zoe uploaded an older version of the sample. I am attaching the updated sample here. updated_newstruct.txt
Best regards, Gerasimos
Dear Gerasimos, thanks. I've just shared it with my colleagues at CNR and ICM to verify the format. Best regards, Harry
Dear Gerasimos, There is something wrong with the file. It's not a valid JSON file, it seems to have the second entry expanded/repeated at the bottom of the file. I've attempted to fix it. Is the attached more like what you had in mind?
Dear Harry
Indeed I did repeat the second line expanded for readability purposes, I am sorry I forgot to mention it beforehand. Yes the updated_newstruct_HD.txt file is the correct one.
Gerasimos
OK, thanks! This looks good to me but I'm still waiting for the confirmation from CNR.
Dear Zoe, Gerasimos, all,
Claudio (CNR) sent the following remarks today:
"Hi All, apologies for the delayed reply… I agree with Marek, I’d prefer avoiding to get even more data mapping tasks, so I’m preparing a full JSON example that hopefully Dentica can use as a reference to produce a record that we can directly map into our internal model.
mmh.. perhaps I miss a bit of context on what they are supposed to deliver. Each line in the JSON file (I’m looking at the one patched by Harry) represents the association between one chemical compound identified via its pubchemID and a list of articles identified by their PMIDs and DOIs, so I assume that the data they produced is grouped by pubchemID. From the discussion we had so far, pubchemIDs should be introduced in the OpenAIRE graph as externalreference(s), so as attributes of result entities that do exist in the graph that indicate an entity external to the graph itself. Therefore the model they should produce must have as main entity the publication (result), identified via OpenAIRE ID, to which the pubchemID must be added as an externalreference"
I’m going to finalise the full example in the next few hours (sorry but I’m juggling on several things at the same time)"
So, as I understand it, the results should be inverted. Instead of having a chemical per entry/line associated with a number of publications, it seems we require having a publications per line associated with a number of chemicals. That is because in OpenAIRE a publication is the main entry which is linked to a number of other objects (funded project, dataset, software, patents, bioentities, etc.)
Let's also wait for the example that Claudio is going to provide.
Best regards, Harry
PS. We are negotiating a one week extension of the deadlines for phase 2 because of OpenAIRE's virtual General Assembly between 12-17 Oct (demos are due on the 16th but nodoby from OpenAIRE will be available). Coralia is currently considering it.
And here is the sample JSON file from Claudio: publication_record.txt
"Here’s the minimal JSON record representing a publication in the OpenAIRE graph with the information from the file you cleaned. The most crucial aspect is the id
field that must contain the matched OpenAIRE identifier
That record can be mapped directly into our internal Oaf model, specifically as a Publication https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java"
Let me know if the above makes sense to you, else I'll give you Claudio's email to speak to him directly.
Dear Harry,
I tried to replicate Claudio's JSON structure, here is a sample. structSample.txt
I also added the Journal's title at the very end of the JSON which was missing from Claudio's example. Could you please take a look and let us know if there is something wrong/missing?
Best regards, Gerasimos
Thanks, Gerasime, I'm waiting for Claudio's OK. Best, Harry
Dear Gerasime,
Concerning the addition of “container”: { “name”: “Journal of pharmaceutical sciences” } since the class "Publication" already declares "journal" it's best to use "journal" instead of "container" in order to align with our current model.
Interestingly, we’re planning to rename that field as container to align with the Guidelines v4 but this will be done in the future.
Otherwise, all else is fine.
Best regards, Harry
Dear Harry,
We are ready to submit
D2) Project abstracts (summary of the actions to be completed during Phase 2) (dealine 15/9) Output and results: pdf document
D2.1) Documentation (deadline 28/9) Detailed presentation and documentation of code, API(s), licenses used, to build the prototype service(s) Output and results: Online document
Should we send to you or OpenAire directly?
Also, you had mentioned that we may get a small extension for the end of the project (26/10/2020): D2.2) Phase 2 report
Please do let us know if an extension will be granted, so that we plan accordingly.
Best Zoe
Dear Zoe,
Please send it both to me and Coralia, because they have been very slow to reply lately, so I'd like to have a copy (they might send it to me very delayed).
There has been complete silence from Coralia about the extension. Thanks for reminding me. I'll contact Nektaria there today to find out about it.
Best regards, Harry
Great, thanks. Can you share your email here? Alternatively write me an email to zcournia at bioacademy.gr.
Great, I've just sent you an email with my two email accounts (ΕΚΠΑ & ATHENA RC).
Dear Zoe,
I got a reply from Coralia saying that "the deliverables must be sent to openaire@corallia.org and we will make sure that all evaluators and respective supervisors receive them as well."
In addition, this evening all SMEs will receive an email from Coralia with a slightly revised timeplan for the deadlines.
Best regards, Harry
ΟΚ thanks - I will be sending the delis to this address with you on cc.
Thanks, Zoe. Have a nice weekend.
Dear Gerasime,
Concerning the addition of “container”: { “name”: “Journal of pharmaceutical sciences” } since the class "Publication" already declares "journal" it's best to use "journal" instead of "container" in order to align with our current model.
Interestingly, we’re planning to rename that field as container to align with the Guidelines v4 but this will be done in the future.
Otherwise, all else is fine.
Best regards, Harry
Dear Harry,
Regarding the journal tag change, could you please verify that this sample is in the correct format? public_record.txt
Best regards, Gerasimos
Dear Gerasimos,
Sorry for the delayed reply, Claudio just replied that he tested the file "public_record.txt" and parsed it correctly as a Publication. So it all looks good.
Best regards, Harry
That is great, thanks a lot for letting me know.
Best regards, Gerasimos
Dear Harry,
We are all set for our conference call on October 26.
Are there any guidelines/template that we should follow?
Also, what is expected to present in the prototype demonstration ?
Best Zoe
Dear Harry,
we received an email to contact you as our supervisor to discuss details about the implementation of OPENAIRE project Phase 2.
Can we arrange a meeting for tomorrow? We are available at 10am or 1pm EEST.
Best, Ingredio team