openaire / open-innovation2020

OpenAIRE-Advance Open Innovation call
https://www.openaire.eu/open-innovation-in-openaire
2 stars 0 forks source link

Meeting with supervisor (Dentica) #2

Open AChatzigoulas opened 4 years ago

AChatzigoulas commented 4 years ago

Dear Harry,

we received an email to contact you as our supervisor to discuss details about the implementation of OPENAIRE project Phase 2.

Can we arrange a meeting for tomorrow? We are available at 10am or 1pm EEST.

Best, Ingredio team

harry-di commented 4 years ago

Dear @AChatzigoula, unfortunately 10am or 1pm EEST would not work for me, as I'm already booked with other telcos during that period, but I'm available on Thursday until between 11am and 3:30pm Athens time. Would that work for you?

Kind regards, Harry

zoecournia commented 4 years ago

Hi Harry,

Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe

zoecournia commented 4 years ago

Hi Harry,

Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe

zoecournia commented 4 years ago

Hi Harry,

Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe

zoecournia commented 4 years ago

Hi Harry,

Unfortunately we are not available Thursday and Friday. Would any time work for you tomorrow? Best Zoe

harry-di commented 4 years ago

Hi Zoe,

The only slot that I can cancel tomorrow is 3-4:30pm (attending the VLDB 2020 Round table discussion on "Intelligent Data Exploration") but that's ok if it suits you as I want to know how things are going on and what help you require from OpenAIRE.

Best, Harry

zoecournia commented 4 years ago

Hi Harry,

No that's ok please dont cancel it. I will be traveling on Thursday but I guess I could participate in a call at 11am. Will you send us a zoom link?

harry-di commented 4 years ago

OK, great, I'll send a link for Thu 11am.

harry-di commented 4 years ago

Dentica Meeting - OpenAIRE-Advance Open Innovation Call Thu, Sep 3, 2020 11:00 AM - 12:00 PM (EEST)

Please join my meeting from your computer, tablet or smartphone.

https://global.gotomeeting.com/join/830960165

You can also dial in using your phone. (For supported devices, tap a one-touch number below to join instantly.)

United States (Toll Free): 1 866 899 4679

United States: +1 (571) 317-3116

Access Code: 830-960-165

More phone numbers: (For supported devices, tap a one-touch number below to join instantly.)

Australia: +61 2 9091 7603

Austria: +43 7 2081 5337

Belgium: +32 28 93 7002

Canada: +1 (647) 497-9373

Denmark: +45 32 72 03 69

Finland: +358 923 17 0556

France: +33 187 210 241

Germany: +49 721 6059 6510

Ireland: +353 15 360 756

Italy: +39 0 230 57 81 80

Netherlands: +31 207 941 375

New Zealand: +64 9 913 2226

Norway: +47 21 93 37 37

Spain: +34 932 75 1230

Sweden: +46 853 527 818

Switzerland: +41 225 4599 60

United Kingdom: +44 20 3713 5011

New to GoToMeeting? Get the app now and be ready when your first meeting starts: https://global.gotomeeting.com/install/830960165

harry-di commented 4 years ago

Dear Zoe, all,

Regarding the issues you've been having with the OpenAIRE I've managed to get a reply from Claudio at CNR: "Please ignore that dump files. We got a more recent one on OpenAIRE's Hadoop HDFS, represented in a simpler and better documented json-based data model".

harry-di commented 4 years ago

So, he will get back to me with more details soon (he's on parental leave this week, but he might give me more info by tomorrow).

zoecournia commented 4 years ago

Dear Harry, This is excellent news! Looking forward to receiving the new files.

harry-di commented 4 years ago

Dear Zoe, all,

Claudio said that the new graph dump is represented according to the following json schema https://code-repo.d4science.org/miriam.baglioni/dnet-hadoop/src/branch/dump/dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/dump_whole/schema (You might find some minor issues here and there as we're still finalising it, but it will be published soon as the official graph dump model v1.0)

I'm still waiting for instructions on how you can access this dump on HDFS.

harry-di commented 4 years ago

Claudio now informed me that HDFS is not normally accessed by external users, so he's trying to get an answer from project admin if we can consider you as "extended tech team" and grant you access for the duration of phase 2, etc.

zoecournia commented 4 years ago

Dear Harry,

Thank you very much for your prompt actions. It would be great to have access to the file in this formal. Also, we'd certainly need to know if this is a format that you plan to stick to in order to adopt it. If it;s possible to send us the feedback of our application, that would be great.

harry-di commented 4 years ago

Dear Zoe,

I just had an update from Claudio:

" last week when we discussed the possibility to grant temporary access on HDFS to the OpenCall participants I didn’t keep in mind that the whole set of tools and web UIs we (tech team) use on a daily basis are reachable only through our VPN, so in my opinion this is a no-go for external users, I don’t think we can support them to configure the clients on their sides. So since Miriam won’t be back until next week we cannot expect the dump to be published on Zenodo until at least 10days/2weeks, therefore to cut the corner and save some time I’m trying to move the file containing the publications (plus the other result types) on some VM@CNR, where I’ll make it temporarily available for some time through an HTTP url. "

I think that is the best and easiest solution for you at the moment. Let's wait for Claudio to copy the data to a site you can access and download it.

Regarding the format, the official version will be published in about 10 days but it is very likely that it will be identical to this one.

All the best, Harry

harry-di commented 4 years ago

Actually he's just done it:

_Harry the result.tar file is available from https://dev-openaire.d4science.org/dump/result.tar I downloaded ~4Gb of it, un-tarred and noticed the content is there, so assuming the transfer didn’t corrupt the rest of the file it should be OK to pass it over to the participants.

Regarding the data model, we’re going to use that JSON schema as reference model for these json dump files, but as I mentioned in skype with Thanasis &CO it still needs some adjustments before it can be officially published (edited)_

zoecournia commented 4 years ago

Hi Harry, thanks a lot! - I am forwarding this info to the team and we ;ll let you know if we have any questions!!

zoecournia commented 3 years ago

Dear Harry,

We have now downloaded and processed the json file. I would like to confirm that you would like us to provide our new dump to the OpenAire Research Graph in this format. If yes, we have processed an example of five entries in the same json format, for you to check that this is indeed the desired format. How can I send you the files?

Best Zoe

zoecournia commented 3 years ago

Archive.zip OK I could upload the files here.

result_sample.json contains five entries from your file

ingredio_compounds_sample.json contains five entries from our data

The final json file that we will deliver to you can be in the format of ingredio_compounds_sample.json or are there any other changes needed?

Also, if you could send us the feedback we received form reviewers that would be great.

harry-di commented 3 years ago

Dear Zoe, I'm glad that you managed to download and process the json graph file. Thank you for providing a sample of your output.

So if I understand this correctly, in each line you provide a chemical compound (pubchem id) linked to a number of PMC publications, giving PMID, Pubchem_ID, Article (title), Journal, Abstract, and DOI for each. This looks fine to me but I've shared your example with Claudio at CNR to be absolutely sure, so I'll let you know as soon as he replies.

Are you going to be processing only PubMed articles or from other repostories too? The graph contains only Abstracts, so let me know if you will later be also requiring full-texts of any subset of the publications. PDFs when available (depending on the license, etc.) are converted to plaintexts in OpenAIRE. So if you have a set of OpenAIRE IDs or DOIs or other publication IDs (like PMID), Marek from ICM could fetch the plaintexts for you.

I'll try to fetch the reviewers' comments from Phase 2 for you later today.

harry-di commented 3 years ago

The comment I got from CNR was: so at least we need: sitename (e.g. PubChem) label (the title/label of the chemical) url (url to the chemical in PubChem) refidentifier (the id of the chemical)

CNR is waiting also for the opinion of ICM who deals with mining representation in OpenAIRE.

harry-di commented 3 years ago

Sorry actually I truncated the full comment from Alessia at CNR, here it is:

Alessia Bardi 4:01 PM I think it would be better to have also the openaire identifier of the publication in their output Probably the pubchem identifier is not enough for us. Assuming we will include them as we do for PDB. let me check what we need 4:07 ok, see here: https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/ExternalReference.java

(My comment... this is how we represent PDB entries in OpenAIRE, as external references, so Alessia is suggesting we do the same with chemicals).

Alessia Bardi 4:12 PM so at least we need: sitename (e.g. PubChem) label (the title/label of the chemical) url (url to the chemical in PubChem) refidentifier (the id of the chemical)

harry-di commented 3 years ago

Marek from ICM added that he is fine with the output provided and Alessia's comments but wanted to know if you are going to provide the dumps periodically to make them "consumable" by OpenAIRE or is your codebase planned to be run as a part of IIS (Information Inference Service of OpenAIRE)?

As far as I remember from your proposal (paragraph on "Maintenance"), you will not be providing or integrating code with IIS but only providing updates. Am I correct?

harry-di commented 3 years ago

Here are the comments of the consensus report for Phase 2:

Comments The whole concept of DENTICA is very interesting and useful for OpenAIRE, while the workflow of Ingredio and its app is attractive for commercialisation. This is clearly a win-win collaboration that will enrich the OpenAIRE Research Graph, even if it is mostly relevant to the domain of chemical ingredients in food and cosmetics, while also delivering a new product for their app and benefiting their company.
In evaluating the initial Phase 1 proposal, there was almost a complete lack of information on how the text mining algorithms would work; however, this concern has now been adequately addressed both in the Phase 1 deliverable and even further in this Phase 2 Prototype template with step-by-step descriptions. Some details might still not be fully clear, but the overall solution design now looks very well-structured and reasonable. Minor revision of the approach for the integration of the results into the OpenAIRE Research Graph may be required but this could be discussed with the OpenAIRE Technical team during the Phase 2 implementation. There were no updates to their Business Canvas model or the Cost Plan.

zoecournia commented 3 years ago

Dear Harry,

Thank you for your feedback on all matters and for the proposal feedback! Here are some responses.

1) Regarding the format, we looked at https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/ExternalReference.java and have two questions: a) what is expected in "qualifier"? b) what is expected in "query"?

We can fill in the rest of the entries with no problems. Today we plan to finalize our training set for the machine learning algorithm that we will write to explore the OpenAire dumps. The training set will have the form/entries of the link above.

2) Regarding the updates, indeed we had planned to provide dumps periodically to update the content. If you think is useful to integrate code with IIS we can certainly do it towards the end of the project/after the project since it is not one of our deliverables.

Best Zoe

harry-di commented 3 years ago

Dear Zoe,

1a) "qualifier" is meant to indicate the typology of the external reference. Currently the values supported by the relative vocabulary (dnet:externalReference_typologies) are the following:

So, depending on the type of the external reference you are going to provide us, you can pick one of those 4 values, or suggest new ones, so that we can extend our vocabulary definition. Here’s the json representation for two of them

{"classid":"url","classname":"url","schemeid":"dnet:externalReference_typologies","schemename":"dnet:externalReference_typologies"}
{"classid":"accessionNumber","classname":"accessionNumber","schemeid":"dnet:externalReference_typologies","schemename":"dnet:externalReference_typologies"}

1b) Regarding the field "query" at the moment it is not used. I suggest you keep it empty.

2) OK great. Regarding integration with IIS we can discuss about it after the end of the project.

Best regards, Harry

zoecournia commented 3 years ago

Dear Harry,

Thanks very much. We have an updated json file. Could you and your colleagues take a look so that we can finalize the format? newstruct.txt

Best Zoe

GerasimosKou commented 3 years ago

Dear Harry,

Zoe uploaded an older version of the sample. I am attaching the updated sample here. updated_newstruct.txt

Best regards, Gerasimos

harry-di commented 3 years ago

Dear Gerasimos, thanks. I've just shared it with my colleagues at CNR and ICM to verify the format. Best regards, Harry

harry-di commented 3 years ago

Dear Gerasimos, There is something wrong with the file. It's not a valid JSON file, it seems to have the second entry expanded/repeated at the bottom of the file. I've attempted to fix it. Is the attached more like what you had in mind?

updated_newstruct_HD.txt

GerasimosKou commented 3 years ago

Dear Harry

Indeed I did repeat the second line expanded for readability purposes, I am sorry I forgot to mention it beforehand. Yes the updated_newstruct_HD.txt file is the correct one.

Gerasimos

harry-di commented 3 years ago

OK, thanks! This looks good to me but I'm still waiting for the confirmation from CNR.

harry-di commented 3 years ago

Dear Zoe, Gerasimos, all,

Claudio (CNR) sent the following remarks today:

"Hi All, apologies for the delayed reply… I agree with Marek, I’d prefer avoiding to get even more data mapping tasks, so I’m preparing a full JSON example that hopefully Dentica can use as a reference to produce a record that we can directly map into our internal model.

mmh.. perhaps I miss a bit of context on what they are supposed to deliver. Each line in the JSON file (I’m looking at the one patched by Harry) represents the association between one chemical compound identified via its pubchemID and a list of articles identified by their PMIDs and DOIs, so I assume that the data they produced is grouped by pubchemID. From the discussion we had so far, pubchemIDs should be introduced in the OpenAIRE graph as externalreference(s), so as attributes of result entities that do exist in the graph that indicate an entity external to the graph itself. Therefore the model they should produce must have as main entity the publication (result), identified via OpenAIRE ID, to which the pubchemID must be added as an externalreference"

I’m going to finalise the full example in the next few hours (sorry but I’m juggling on several things at the same time)"

So, as I understand it, the results should be inverted. Instead of having a chemical per entry/line associated with a number of publications, it seems we require having a publications per line associated with a number of chemicals. That is because in OpenAIRE a publication is the main entry which is linked to a number of other objects (funded project, dataset, software, patents, bioentities, etc.)

Let's also wait for the example that Claudio is going to provide.

Best regards, Harry

PS. We are negotiating a one week extension of the deadlines for phase 2 because of OpenAIRE's virtual General Assembly between 12-17 Oct (demos are due on the 16th but nodoby from OpenAIRE will be available). Coralia is currently considering it.

harry-di commented 3 years ago

And here is the sample JSON file from Claudio: publication_record.txt

"Here’s the minimal JSON record representing a publication in the OpenAIRE graph with the information from the file you cleaned. The most crucial aspect is the id field that must contain the matched OpenAIRE identifier

That record can be mapped directly into our internal Oaf model, specifically as a Publication https://code-repo.d4science.org/D-Net/dnet-hadoop/src/branch/master/dhp-schemas/src/main/java/eu/dnetlib/dhp/schema/oaf/Publication.java"

harry-di commented 3 years ago

Let me know if the above makes sense to you, else I'll give you Claudio's email to speak to him directly.

GerasimosKou commented 3 years ago

Dear Harry,

I tried to replicate Claudio's JSON structure, here is a sample. structSample.txt

I also added the Journal's title at the very end of the JSON which was missing from Claudio's example. Could you please take a look and let us know if there is something wrong/missing?

Best regards, Gerasimos

harry-di commented 3 years ago

Thanks, Gerasime, I'm waiting for Claudio's OK. Best, Harry

harry-di commented 3 years ago

Dear Gerasime,

Concerning the addition of “container”: { “name”: “Journal of pharmaceutical sciences” } since the class "Publication" already declares "journal" it's best to use "journal" instead of "container" in order to align with our current model.

Interestingly, we’re planning to rename that field as container to align with the Guidelines v4 but this will be done in the future.

Otherwise, all else is fine.

Best regards, Harry

zoecournia commented 3 years ago

Dear Harry,

We are ready to submit

D2) Project abstracts (summary of the actions to be completed during Phase 2) (dealine 15/9) Output and results: pdf document

D2.1) Documentation (deadline 28/9) Detailed presentation and documentation of code, API(s), licenses used, to build the prototype service(s) Output and results: Online document

Should we send to you or OpenAire directly?

Also, you had mentioned that we may get a small extension for the end of the project (26/10/2020): D2.2) Phase 2 report

Please do let us know if an extension will be granted, so that we plan accordingly.

Best Zoe

harry-di commented 3 years ago

Dear Zoe,

Please send it both to me and Coralia, because they have been very slow to reply lately, so I'd like to have a copy (they might send it to me very delayed).

There has been complete silence from Coralia about the extension. Thanks for reminding me. I'll contact Nektaria there today to find out about it.

Best regards, Harry

zoecournia commented 3 years ago

Great, thanks. Can you share your email here? Alternatively write me an email to zcournia at bioacademy.gr.

harry-di commented 3 years ago

Great, I've just sent you an email with my two email accounts (ΕΚΠΑ & ATHENA RC).

harry-di commented 3 years ago

Dear Zoe,

I got a reply from Coralia saying that "the deliverables must be sent to openaire@corallia.org and we will make sure that all evaluators and respective supervisors receive them as well."

In addition, this evening all SMEs will receive an email from Coralia with a slightly revised timeplan for the deadlines.

Best regards, Harry

zoecournia commented 3 years ago

ΟΚ thanks - I will be sending the delis to this address with you on cc.

harry-di commented 3 years ago

Thanks, Zoe. Have a nice weekend.

GerasimosKou commented 3 years ago

Dear Gerasime,

Concerning the addition of “container”: { “name”: “Journal of pharmaceutical sciences” } since the class "Publication" already declares "journal" it's best to use "journal" instead of "container" in order to align with our current model.

Interestingly, we’re planning to rename that field as container to align with the Guidelines v4 but this will be done in the future.

Otherwise, all else is fine.

Best regards, Harry

Dear Harry,

Regarding the journal tag change, could you please verify that this sample is in the correct format? public_record.txt

Best regards, Gerasimos

harry-di commented 3 years ago

Dear Gerasimos,

Sorry for the delayed reply, Claudio just replied that he tested the file "public_record.txt" and parsed it correctly as a Publication. So it all looks good.

Best regards, Harry

GerasimosKou commented 3 years ago

That is great, thanks a lot for letting me know.

Best regards, Gerasimos

zoecournia commented 3 years ago

Dear Harry,

We are all set for our conference call on October 26.

Are there any guidelines/template that we should follow?

Also, what is expected to present in the prototype demonstration ?

Best Zoe